You are viewing a plain text version of this content. The canonical link for it is here.

Posted to hdfs-user@hadoop.apache.org by Anfernee Xu <an...@gmail.com> on 2014/09/07 18:37:02 UTC

In Yarn how to increase the number of concurrent applications for a queue

Hi,

I'rm running my cluster at Hadoop 2.2.0,  and use CapacityScheduler. And
all my jobs are uberized and running among 2 queues, one queue takes
majority of capacity(90%), another take 10%. What I found is for small
queue, only one job is running for a given time, I tried twisting below
properties, but no luck so far, could you guys share some light on this?

 <property>
    <name>yarn.scheduler.capacity.maximum-am-resource-percent</name>
    <value>1.0</value>
    <description>
      Maximum percent of resources in the cluster which can be used to run
      application masters i.e. controls number of concurrent running
      applications.
    </description>
  </property>


 <property>
    <name>yarn.scheduler.capacity.root.queues</name>
    <value>default,small</value>
    <description>
      The queues at the this level (root is the root queue).
    </description>
  </property>

 <property>

<name>yarn.scheduler.capacity.root.small.maximum-am-resource-percent</name>
    <value>1.0</value>
  </property>


 <property>
    <name>yarn.scheduler.capacity.root.small.user-limit</name>
    <value>1</value>
  </property>

 <property>
    <name>yarn.scheduler.capacity.root.default.capacity</name>
    <value>88</value>
    <description>Default queue target capacity.</description>
  </property>


  <property>
    <name>yarn.scheduler.capacity.root.small.capacity</name>
    <value>12</value>
    <description>Default queue target capacity.</description>
  </property>

 <property>
    <name>yarn.scheduler.capacity.root.default.maximum-capacity</name>
    <value>88</value>
    <description>
      The maximum capacity of the default queue.
    </description>
  </property>

  <property>
    <name>yarn.scheduler.capacity.root.small.maximum-capacity</name>
    <value>12</value>
    <description>Maximum queue capacity.</description>
  </property>


Thanks

-- 
--Anfernee

Re: In Yarn how to increase the number of concurrent applications for a queue

Posted by Anfernee Xu <an...@gmail.com>.

Sure, I can open a jira, but how can I do it? I went to

https://issues.apache.org/jira/browse/YARN/?selectedTab=com.atlassian.jira.jira-projects-plugin:summary-panel

But I did not see any link can lead me to open a new jira? do I miss
something?

BTW, I found another interesting issue, as all our jobs are uberized and we
have 2 queues(default and small), all jobs for default queue are fine, but
jobs on small queue ran slowly compared to default queue, the major
difference is the time spent in job commit, as you can see from below log,
the user logic was finished at 05:28:06,984,
and then it kept going for 21 seconds, and at 05:28:27,036 the job was
allowed to commit, whereas on default queue, it only takes less than 1
second for this.

Do you have any idea about what can cause this? Is it due to the restricted
resource(small queue only has 10 nodes whereas default has 100 nodes).


2014-09-09 05:28:06,984 INFO [job-thread-8283272023] Job is Done
2014-09-09 05:28:06,985 INFO [uber-SubtaskRunner]
org.apache.hadoop.mapred.TaskAttemptListenerImpl: Status update from
attempt_1410195300700_18702_m_000000_0
2014-09-09 05:28:06,987 INFO [uber-SubtaskRunner]
org.apache.hadoop.mapred.TaskAttemptListenerImpl: Progress of TaskAttempt
attempt_1410195300700_18702_m_000000_0 is : 0.0
2014-09-09 05:28:07,004 INFO [uber-SubtaskRunner]
org.apache.hadoop.mapred.Task: Task:attempt_1410195300700_18702_m_000000_0
is done. And is in the process of committing
2014-09-09 05:28:07,028 INFO [uber-SubtaskRunner]
org.apache.hadoop.mapred.TaskAttemptListenerImpl: Commit-pending state
update from attempt_1410195300700_18702_m_000000_0
2014-09-09 05:28:07,029 INFO [AsyncDispatcher event handler]
org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl:
attempt_1410195300700_18702_m_000000_0 TaskAttempt Transitioned from
RUNNING to COMMIT_PENDING
2014-09-09 05:28:07,029 INFO [uber-SubtaskRunner]
org.apache.hadoop.mapred.TaskAttemptListenerImpl: Commit go/no-go request
from attempt_1410195300700_18702_m_000000_0
2014-09-09 05:28:07,029 INFO [AsyncDispatcher event handler]
org.apache.hadoop.mapreduce.v2.app.job.impl.TaskImpl:
attempt_1410195300700_18702_m_000000_0 given a go for committing the task
output.
2014-09-09 05:28:08,029 INFO [uber-SubtaskRunner]
org.apache.hadoop.mapred.TaskAttemptListenerImpl: Commit go/no-go request
from attempt_1410195300700_18702_m_000000_0
2014-09-09 05:28:09,030 INFO [uber-SubtaskRunner]
org.apache.hadoop.mapred.TaskAttemptListenerImpl: Commit go/no-go request
from attempt_1410195300700_18702_m_000000_0
2014-09-09 05:28:09,968 INFO [communication thread]
org.apache.hadoop.mapred.TaskAttemptListenerImpl: Status update from
attempt_1410195300700_18702_m_000000_0
2014-09-09 05:28:09,968 INFO [communication thread]
org.apache.hadoop.mapred.TaskAttemptListenerImpl: Progress of TaskAttempt
attempt_1410195300700_18702_m_000000_0 is : 1.0
2014-09-09 05:28:10,030 INFO [uber-SubtaskRunner]
org.apache.hadoop.mapred.TaskAttemptListenerImpl: Commit go/no-go request
from attempt_1410195300700_18702_m_000000_0
2014-09-09 05:28:11,030 INFO [uber-SubtaskRunner]
org.apache.hadoop.mapred.TaskAttemptListenerImpl: Commit go/no-go request
from attempt_1410195300700_18702_m_000000_0
2014-09-09 05:28:12,031 INFO [uber-SubtaskRunner]
org.apache.hadoop.mapred.TaskAttemptListenerImpl: Commit go/no-go request
from attempt_1410195300700_18702_m_000000_0
2014-09-09 05:28:12,986 INFO [communication thread]
org.apache.hadoop.mapred.TaskAttemptListenerImpl: Status update from
attempt_1410195300700_18702_m_000000_0
2014-09-09 05:28:12,986 INFO [communication thread]
org.apache.hadoop.mapred.TaskAttemptListenerImpl: Progress of TaskAttempt
attempt_1410195300700_18702_m_000000_0 is : 1.0
2014-09-09 05:28:13,031 INFO [uber-SubtaskRunner]
org.apache.hadoop.mapred.TaskAttemptListenerImpl: Commit go/no-go request
from attempt_1410195300700_18702_m_000000_0
2014-09-09 05:28:14,031 INFO [uber-SubtaskRunner]
org.apache.hadoop.mapred.TaskAttemptListenerImpl: Commit go/no-go request
from attempt_1410195300700_18702_m_000000_0
2014-09-09 05:28:15,032 INFO [uber-SubtaskRunner]
org.apache.hadoop.mapred.TaskAttemptListenerImpl: Commit go/no-go request
from attempt_1410195300700_18702_m_000000_0
2014-09-09 05:28:16,001 INFO [communication thread]
org.apache.hadoop.mapred.TaskAttemptListenerImpl: Status update from
attempt_1410195300700_18702_m_000000_0
2014-09-09 05:28:16,002 INFO [communication thread]
org.apache.hadoop.mapred.TaskAttemptListenerImpl: Progress of TaskAttempt
attempt_1410195300700_18702_m_000000_0 is : 1.0
2014-09-09 05:28:16,032 INFO [uber-SubtaskRunner]
org.apache.hadoop.mapred.TaskAttemptListenerImpl: Commit go/no-go request
from attempt_1410195300700_18702_m_000000_0
2014-09-09 05:28:17,033 INFO [uber-SubtaskRunner]
org.apache.hadoop.mapred.TaskAttemptListenerImpl: Commit go/no-go request
from attempt_1410195300700_18702_m_000000_0
2014-09-09 05:28:18,033 INFO [uber-SubtaskRunner]
org.apache.hadoop.mapred.TaskAttemptListenerImpl: Commit go/no-go request
from attempt_1410195300700_18702_m_000000_0
2014-09-09 05:28:19,019 INFO [communication thread]
org.apache.hadoop.mapred.TaskAttemptListenerImpl: Status update from
attempt_1410195300700_18702_m_000000_0
2014-09-09 05:28:19,019 INFO [communication thread]
org.apache.hadoop.mapred.TaskAttemptListenerImpl: Progress of TaskAttempt
attempt_1410195300700_18702_m_000000_0 is : 1.0
2014-09-09 05:28:19,034 INFO [uber-SubtaskRunner]
org.apache.hadoop.mapred.TaskAttemptListenerImpl: Commit go/no-go request
from attempt_1410195300700_18702_m_000000_0
2014-09-09 05:28:20,034 INFO [uber-SubtaskRunner]
org.apache.hadoop.mapred.TaskAttemptListenerImpl: Commit go/no-go request
from attempt_1410195300700_18702_m_000000_0
2014-09-09 05:28:21,034 INFO [uber-SubtaskRunner]
org.apache.hadoop.mapred.TaskAttemptListenerImpl: Commit go/no-go request
from attempt_1410195300700_18702_m_000000_0
2014-09-09 05:28:22,034 INFO [communication thread]
org.apache.hadoop.mapred.TaskAttemptListenerImpl: Status update from
attempt_1410195300700_18702_m_000000_0
2014-09-09 05:28:22,034 INFO [communication thread]
org.apache.hadoop.mapred.TaskAttemptListenerImpl: Progress of TaskAttempt
attempt_1410195300700_18702_m_000000_0 is : 1.0
2014-09-09 05:28:22,034 INFO [uber-SubtaskRunner]
org.apache.hadoop.mapred.TaskAttemptListenerImpl: Commit go/no-go request
from attempt_1410195300700_18702_m_000000_0
2014-09-09 05:28:23,035 INFO [uber-SubtaskRunner]
org.apache.hadoop.mapred.TaskAttemptListenerImpl: Commit go/no-go request
from attempt_1410195300700_18702_m_000000_0
2014-09-09 05:28:24,035 INFO [uber-SubtaskRunner]
org.apache.hadoop.mapred.TaskAttemptListenerImpl: Commit go/no-go request
from attempt_1410195300700_18702_m_000000_0
2014-09-09 05:28:25,035 INFO [uber-SubtaskRunner]
org.apache.hadoop.mapred.TaskAttemptListenerImpl: Commit go/no-go request
from attempt_1410195300700_18702_m_000000_0
2014-09-09 05:28:25,049 INFO [communication thread]
org.apache.hadoop.mapred.TaskAttemptListenerImpl: Status update from
attempt_1410195300700_18702_m_000000_0
2014-09-09 05:28:25,049 INFO [communication thread]
org.apache.hadoop.mapred.TaskAttemptListenerImpl: Progress of TaskAttempt
attempt_1410195300700_18702_m_000000_0 is : 1.0
2014-09-09 05:28:26,035 INFO [uber-SubtaskRunner]
org.apache.hadoop.mapred.TaskAttemptListenerImpl: Commit go/no-go request
from attempt_1410195300700_18702_m_000000_0
2014-09-09 05:28:27,036 INFO [uber-SubtaskRunner]
org.apache.hadoop.mapred.TaskAttemptListenerImpl: Commit go/no-go request
from attempt_1410195300700_18702_m_000000_0
2014-09-09 05:28:27,036 INFO [uber-SubtaskRunner]
org.apache.hadoop.mapreduce.v2.app.job.impl.TaskImpl: Result of canCommit
for attempt_1410195300700_18702_m_000000_0:true
2014-09-09 05:28:27,036 INFO [uber-SubtaskRunner]
org.apache.hadoop.mapred.Task: Task attempt_1410195300700_18702_m_000000_0
is allowed to commit now
2014-09-09 05:28:27,088 INFO [uber-SubtaskRunner]
org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter: Saved output of
task 'attempt_1410195300700_18702_m_000000_0' to hdfs://
slc02knk.us.oracle.com:55310/tmp/thirdeye/Publish-28305282698003/_temporary/1/task_1410195300700_18702_m_000000
2014-09-09 05:28:27,104 INFO [uber-SubtaskRunner]
org.apache.hadoop.mapred.TaskAttemptListenerImpl: Status update from
attempt_1410195300700_18702_m_000000_0
2014-09-09 05:28:27,104 INFO [uber-SubtaskRunner]
org.apache.hadoop.mapred.TaskAttemptListenerImpl: Progress of TaskAttempt
attempt_1410195300700_18702_m_000000_0 is : 1.0
2014-09-09 05:28:27,105 INFO [uber-SubtaskRunner]
org.apache.hadoop.mapred.TaskAttemptListenerImpl: Done acknowledgement from
attempt_1410195300700_18702_m_000000_0
2014-09-09 05:28:27,105 INFO [uber-SubtaskRunner]
org.apache.hadoop.mapred.Task: Task
'attempt_1410195300700_18702_m_000000_0' done.
2014-09-09 05:28:27,107 INFO [AsyncDispatcher event handler]
org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl:
attempt_1410195300700_18702_m_000000_0 TaskAttempt Transitioned from
COMMIT_PENDING to SUCCESS_CONTAINER_CLEANUP
2014-09-09 05:28:27,107 INFO [uber-SubtaskRunner]
org.apache.hadoop.mapred.LocalContainerLauncher: Processing the event
EventType: CONTAINER_REMOTE_CLEANUP for container
container_1410195300700_18702_01_000001 taskAttempt
attempt_1410195300700_18702_m_000000_0
2014-09-09 05:28:27,111 INFO [AsyncDispatcher event handler]
org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl:
attempt_1410195300700_18702_m_000000_0 TaskAttempt Transitioned from
SUCCESS_CONTAINER_CLEANUP to SUCCEEDED
2014-09-09 05:28:27,124 INFO [AsyncDispatcher event handler]
org.apache.hadoop.mapreduce.v2.app.job.impl.TaskImpl: Task succeeded with
attempt attempt_1410195300700_18702_m_000000_0
2014-09-09 05:28:27,126 INFO [AsyncDispatcher event handler]
org.apache.hadoop.mapreduce.v2.app.job.impl.TaskImpl:
task_1410195300700_18702_m_000000 Task Transitioned from RUNNING to
SUCCEEDED


On Tue, Sep 9, 2014 at 10:58 AM, Arun Murthy <ac...@hortonworks.com> wrote:

> Thanks for digging into this. Mind opening a jira to discuss further? Much
> appreciated.
>
> Arun
>
> On Mon, Sep 8, 2014 at 7:15 PM, Anfernee Xu <an...@gmail.com> wrote:
>
>> It turned out that it's not a configuration issue, some worker thread
>> which submits job to Yarn was blocked, see below thread dump
>>
>> "pool-1-thread-160" id=194 idx=0x30c tid=886 prio=5 alive, blocked,
>> native_blocked
>>     -- Blocked trying to get lock:
>> org/apache/hadoop/ipc/Client$Connection@0x1059d0c60[thin lock]
>>     at __lll_lock_wait+36(:0)@0x340260d594
>>     at tsSleep+399(threadsystem.c:83)@0x2b2356e5da80
>>     at jrockit/vm/Threads.sleep(I)V(Native Method)
>>     at jrockit/vm/Locks.waitForThinRelease(Locks.java:955)[optimized]
>>     at
>> jrockit/vm/Locks.monitorEnterSecondStageHard(Locks.java:1083)[optimized]
>>     at
>> jrockit/vm/Locks.monitorEnterSecondStage(Locks.java:1005)[optimized]
>>     at
>> org/apache/hadoop/ipc/Client$Connection.addCall(Client.java:400)[inlined]
>>     at
>> org/apache/hadoop/ipc/Client$Connection.access$2500(Client.java:314)[inlined]
>>     at
>> org/apache/hadoop/ipc/Client.getConnection(Client.java:1393)[optimized]
>>     at org/apache/hadoop/ipc/Client.call(Client.java:1318)[inlined]
>>     at org/apache/hadoop/ipc/Client.call(Client.java:1300)[inlined]
>>     at
>> org/apache/hadoop/ipc/ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206)[optimized]
>>     at
>> $Proxy21.getJobReport(Lcom/google/protobuf/RpcController;Lorg/apache/hadoop/mapreduce/v2/proto/MRServiceProtos$GetJobReportRequestProto;)Lorg/apache/hadoop/mapreduce/v2/proto/MRServiceProtos$GetJobReportResponseProto;(Unknown
>> Source)
>>     at
>> org/apache/hadoop/mapreduce/v2/api/impl/pb/client/MRClientProtocolPBClientImpl.getJobReport(MRClientProtocolPBClientImpl.java:133)[optimized]
>>     at
>> sun/reflect/GeneratedMethodAccessor79.invoke(Ljava/lang/Object;[Ljava/lang/Object;)Ljava/lang/Object;(Unknown
>> Source)[optimized]
>>     at
>> sun/reflect/DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)[optimized]
>>     at java/lang/reflect/Method.invoke(Method.java:597)[inlined]
>>     at
>> org/apache/hadoop/mapred/ClientServiceDelegate.invoke(ClientServiceDelegate.java:317)[inlined]
>>     at
>> org/apache/hadoop/mapred/ClientServiceDelegate.getJobStatus(ClientServiceDelegate.java:416)[optimized]
>>     ^-- Holding lock:
>> org/apache/hadoop/mapred/ClientServiceDelegate@0x10087d788[biased lock]
>>     at
>> org/apache/hadoop/mapred/YarnRunner.getJobStatus(TIEYarnRunner.java:522)[optimized]
>>     at org/apache/hadoop/mapreduce/Job$1.run(Job.java:314)[inlined]
>>     at org/apache/hadoop/mapreduce/Job$1.run(Job.java:311)[inlined]
>>     at
>> jrockit/vm/AccessController.doPrivileged(AccessController.java:254)[inlined]
>>     at javax/security/auth/Subject.doAs(Subject.java:396)[inlined]
>>     at
>> org/apache/hadoop/security/UserGroupInformation.doAs(UserGroupInformation.java:1491)[inlined]
>>     at
>> org/apache/hadoop/mapreduce/Job.updateStatus(Job.java:311)[optimized]
>>     ^-- Holding lock: org/apache/hadoop/mapreduce/Job@0x100522fb8[biased
>> lock]
>>     at org/apache/hadoop/mapreduce/Job.isComplete(Job.java:599)
>>     at org/apache/hadoop/mapreduce/Job.waitForCompletion(Job.java:1294)
>>
>> The lock was held by
>>
>> "pool-1-thread-10" id=44 idx=0xb4 tid=736 prio=5 alive, sleeping,
>> native_waiting
>>     at pthread_cond_timedwait@@GLIBC_2.3.2+288(:0)@0x340260b1c0
>>     at eventTimedWaitNoTransitionImpl+46(event.c:93)@0x2b2356cc741f
>>     at
>> syncWaitForSignalNoTransition+133(synchronization.c:51)@0x2b2356e5a096
>>     at syncWaitForSignal+189(synchronization.c:85)@0x2b2356e5a1ae
>>     at vmtSleep+165(signaling.c:197)@0x2b2356e35ef6
>>     at JVM_Sleep+188(jvmthreads.c:119)@0x2b2356d6bb7d
>>     at java/lang/Thread.sleep(J)V(Native Method)
>>     at
>> org/apache/hadoop/ipc/Client$Connection.handleConnectionFailure(Client.java:778)[optimized]
>>     at
>> org/apache/hadoop/ipc/Client$Connection.setupConnection(Client.java:566)[optimized]
>>     ^-- Holding lock: org/apache/hadoop/ipc/Client$Connection@0x1059d0c60
>> [recursive]
>>     at
>> org/apache/hadoop/ipc/Client$Connection.setupIOstreams(Client.java:642)[optimized]
>>     ^-- Holding lock: org/apache/hadoop/ipc/Client$Connection@0x1059d0c60[thin
>> lock]
>>     at
>> org/apache/hadoop/ipc/Client$Connection.access$2600(Client.java:314)[inlined]
>>     at
>> org/apache/hadoop/ipc/Client.getConnection(Client.java:1399)[optimized]
>>     at org/apache/hadoop/ipc/Client.call(Client.java:1318)[inlined]
>>     at org/apache/hadoop/ipc/Client.call(Client.java:1300)[inlined]
>>     at
>> org/apache/hadoop/ipc/ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206)[optimized]
>>     at
>> $Proxy21.getJobReport(Lcom/google/protobuf/RpcController;Lorg/apache/hadoop/mapreduce/v2/proto/MRServiceProtos$GetJobReportRequestProto;)Lorg/apache/hadoop/mapreduce/v2/proto/MRServiceProtos$GetJobReportResponseProto;(Unknown
>> Source)
>>     at
>> org/apache/hadoop/mapreduce/v2/api/impl/pb/client/MRClientProtocolPBClientImpl.getJobReport(MRClientProtocolPBClientImpl.java:133)[optimized]
>>     at
>> sun/reflect/GeneratedMethodAccessor79.invoke(Ljava/lang/Object;[Ljava/lang/Object;)Ljava/lang/Object;(Unknown
>> Source)[optimized]
>>     at
>> sun/reflect/DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)[optimized]
>>     at java/lang/reflect/Method.invoke(Method.java:597)[inlined]
>>     at
>> org/apache/hadoop/mapred/ClientServiceDelegate.invoke(ClientServiceDelegate.java:317)[inlined]
>>     at
>> org/apache/hadoop/mapred/ClientServiceDelegate.getJobStatus(ClientServiceDelegate.java:416)[optimized]
>>     ^-- Holding lock:
>> org/apache/hadoop/mapred/ClientServiceDelegate@0x1012c34f8[biased lock]
>>     at
>> org/apache/hadoop/mapred/YarnRunner.getJobStatus(TIEYarnRunner.java:522)[optimized]
>>     at org/apache/hadoop/mapreduce/Job$1.run(Job.java:314)[inlined]
>>     at org/apache/hadoop/mapreduce/Job$1.run(Job.java:311)[inlined]
>>     at
>> jrockit/vm/AccessController.doPrivileged(AccessController.java:254)[inlined]
>>     at javax/security/auth/Subject.doAs(Subject.java:396)[inlined]
>>     at
>> org/apache/hadoop/security/UserGroupInformation.doAs(UserGroupInformation.java:1491)[inlined]
>>     at
>> org/apache/hadoop/mapreduce/Job.updateStatus(Job.java:311)[optimized]
>>     ^-- Holding lock: org/apache/hadoop/mapreduce/Job@0x1016e05a8[biased
>> lock]
>>     at org/apache/hadoop/mapreduce/Job.isComplete(Job.java:599)
>>     at org/apache/hadoop/mapreduce/Job.waitForCompletion(Job.java:1294)
>>
>> You can see the thead holding the lock is in sleep state and the calling
>> method is Connection.handleConnectionFailure(), so I checked the our log
>> file and realized the connection failure is about historyserver is not
>> available. In my case, I did not start historyserver at all, because it's
>> not needed(I disabled log-aggregation), so my question is why the job
>> client was still trying to talk to historyserver even log aggregation is
>> disabled.
>>
>> Thanks
>>
>>
>>
>> On Mon, Sep 8, 2014 at 3:57 AM, Arun Murthy <ac...@hortonworks.com> wrote:
>>
>>> How many nodes do you have in your cluster?
>>>
>>> Also, could you share the CapacityScheduler initialization logs for each
>>> queue, such as:
>>>
>>> 2014-08-14 15:14:23,835 INFO
>>> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler:
>>> Initialized queue: unfunded: capacity=0.5, absoluteCapacity=0.5,
>>> usedResources=<memory:0, vCores:0>, usedCapacity=0.0,
>>> absoluteUsedCapacity=0.0, numApps=0, numContainers=0
>>> 2014-08-14 15:14:23,840 INFO
>>> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue:
>>> Initializing default
>>> capacity = 0.5 [= (float) configuredCapacity / 100 ]
>>> asboluteCapacity = 0.5 [= parentAbsoluteCapacity * capacity ]
>>> maxCapacity = 1.0 [= configuredMaxCapacity ]
>>> absoluteMaxCapacity = 1.0 [= 1.0 maximumCapacity undefined,
>>> (parentAbsoluteMaxCapacity * maximumCapacity) / 100 otherwise ]
>>> userLimit = 100 [= configuredUserLimit ]
>>> userLimitFactor = 1.0 [= configuredUserLimitFactor ]
>>> maxApplications = 5000 [= configuredMaximumSystemApplicationsPerQueue or
>>> (int)(configuredMaximumSystemApplications * absoluteCapacity)]
>>> maxApplicationsPerUser = 5000 [= (int)(maxApplications * (userLimit /
>>> 100.0f) * userLimitFactor) ]
>>> maxActiveApplications = 1 [= max((int)ceil((clusterResourceMemory /
>>> minimumAllocation) * maxAMResourcePerQueuePercent * absoluteMaxCapacity),1)
>>> ]
>>> maxActiveAppsUsingAbsCap = 1 [= max((int)ceil((clusterResourceMemory /
>>> minimumAllocation) *maxAMResourcePercent * absoluteCapacity),1) ]
>>> maxActiveApplicationsPerUser = 1 [= max((int)(maxActiveApplications *
>>> (userLimit / 100.0f) * userLimitFactor),1) ]
>>> usedCapacity = 0.0 [= usedResourcesMemory / (clusterResourceMemory *
>>> absoluteCapacity)]
>>> absoluteUsedCapacity = 0.0 [= usedResourcesMemory /
>>> clusterResourceMemory]
>>> maxAMResourcePerQueuePercent = 0.1 [= configuredMaximumAMResourcePercent
>>> ]
>>> minimumAllocationFactor = 0.87506104 [= (float)(maximumAllocationMemory
>>> - minimumAllocationMemory) / maximumAllocationMemory ]
>>> numContainers = 0 [= currentNumContainers ]
>>> state = RUNNING [= configuredState ]
>>> acls = SUBMIT_APPLICATIONS: ADMINISTER_QUEUE:  [= configuredAcls ]
>>> nodeLocalityDelay = 0
>>>
>>>
>>> Then, look at values for maxActiveAppsUsingAbsCap &
>>> maxActiveApplicationsPerUser. That should help debugging.
>>>
>>> thanks,
>>> Arun
>>>
>>>
>>> On Sun, Sep 7, 2014 at 9:37 AM, Anfernee Xu <an...@gmail.com>
>>> wrote:
>>>
>>>> Hi,
>>>>
>>>> I'rm running my cluster at Hadoop 2.2.0,  and use CapacityScheduler.
>>>> And all my jobs are uberized and running among 2 queues, one queue takes
>>>> majority of capacity(90%), another take 10%. What I found is for small
>>>> queue, only one job is running for a given time, I tried twisting below
>>>> properties, but no luck so far, could you guys share some light on this?
>>>>
>>>>  <property>
>>>>     <name>yarn.scheduler.capacity.maximum-am-resource-percent</name>
>>>>     <value>1.0</value>
>>>>     <description>
>>>>       Maximum percent of resources in the cluster which can be used to
>>>> run
>>>>       application masters i.e. controls number of concurrent running
>>>>       applications.
>>>>     </description>
>>>>   </property>
>>>>
>>>>
>>>>  <property>
>>>>     <name>yarn.scheduler.capacity.root.queues</name>
>>>>     <value>default,small</value>
>>>>     <description>
>>>>       The queues at the this level (root is the root queue).
>>>>     </description>
>>>>   </property>
>>>>
>>>>  <property>
>>>>
>>>> <name>yarn.scheduler.capacity.root.small.maximum-am-resource-percent</name>
>>>>     <value>1.0</value>
>>>>   </property>
>>>>
>>>>
>>>>  <property>
>>>>     <name>yarn.scheduler.capacity.root.small.user-limit</name>
>>>>     <value>1</value>
>>>>   </property>
>>>>
>>>>  <property>
>>>>     <name>yarn.scheduler.capacity.root.default.capacity</name>
>>>>     <value>88</value>
>>>>     <description>Default queue target capacity.</description>
>>>>   </property>
>>>>
>>>>
>>>>   <property>
>>>>     <name>yarn.scheduler.capacity.root.small.capacity</name>
>>>>     <value>12</value>
>>>>     <description>Default queue target capacity.</description>
>>>>   </property>
>>>>
>>>>  <property>
>>>>     <name>yarn.scheduler.capacity.root.default.maximum-capacity</name>
>>>>     <value>88</value>
>>>>     <description>
>>>>       The maximum capacity of the default queue.
>>>>     </description>
>>>>   </property>
>>>>
>>>>   <property>
>>>>     <name>yarn.scheduler.capacity.root.small.maximum-capacity</name>
>>>>     <value>12</value>
>>>>     <description>Maximum queue capacity.</description>
>>>>   </property>
>>>>
>>>>
>>>> Thanks
>>>>
>>>> --
>>>> --Anfernee
>>>>
>>>
>>>
>>>
>>> --
>>>
>>> --
>>> Arun C. Murthy
>>> Hortonworks Inc.
>>> http://hortonworks.com/
>>>
>>> CONFIDENTIALITY NOTICE
>>> NOTICE: This message is intended for the use of the individual or entity
>>> to which it is addressed and may contain information that is confidential,
>>> privileged and exempt from disclosure under applicable law. If the reader
>>> of this message is not the intended recipient, you are hereby notified that
>>> any printing, copying, dissemination, distribution, disclosure or
>>> forwarding of this communication is strictly prohibited. If you have
>>> received this communication in error, please contact the sender immediately
>>> and delete it from your system. Thank You.
>>
>>
>>
>>
>> --
>> --Anfernee
>>
>
>
>
> --
>
> --
> Arun C. Murthy
> Hortonworks Inc.
> http://hortonworks.com/
>
> CONFIDENTIALITY NOTICE
> NOTICE: This message is intended for the use of the individual or entity
> to which it is addressed and may contain information that is confidential,
> privileged and exempt from disclosure under applicable law. If the reader
> of this message is not the intended recipient, you are hereby notified that
> any printing, copying, dissemination, distribution, disclosure or
> forwarding of this communication is strictly prohibited. If you have
> received this communication in error, please contact the sender immediately
> and delete it from your system. Thank You.
>



-- 
--Anfernee

Re: In Yarn how to increase the number of concurrent applications for a queue

Posted by Anfernee Xu <an...@gmail.com>.

Sure, I can open a jira, but how can I do it? I went to

https://issues.apache.org/jira/browse/YARN/?selectedTab=com.atlassian.jira.jira-projects-plugin:summary-panel

But I did not see any link can lead me to open a new jira? do I miss
something?

BTW, I found another interesting issue, as all our jobs are uberized and we
have 2 queues(default and small), all jobs for default queue are fine, but
jobs on small queue ran slowly compared to default queue, the major
difference is the time spent in job commit, as you can see from below log,
the user logic was finished at 05:28:06,984,
and then it kept going for 21 seconds, and at 05:28:27,036 the job was
allowed to commit, whereas on default queue, it only takes less than 1
second for this.

Do you have any idea about what can cause this? Is it due to the restricted
resource(small queue only has 10 nodes whereas default has 100 nodes).


2014-09-09 05:28:06,984 INFO [job-thread-8283272023] Job is Done
2014-09-09 05:28:06,985 INFO [uber-SubtaskRunner]
org.apache.hadoop.mapred.TaskAttemptListenerImpl: Status update from
attempt_1410195300700_18702_m_000000_0
2014-09-09 05:28:06,987 INFO [uber-SubtaskRunner]
org.apache.hadoop.mapred.TaskAttemptListenerImpl: Progress of TaskAttempt
attempt_1410195300700_18702_m_000000_0 is : 0.0
2014-09-09 05:28:07,004 INFO [uber-SubtaskRunner]
org.apache.hadoop.mapred.Task: Task:attempt_1410195300700_18702_m_000000_0
is done. And is in the process of committing
2014-09-09 05:28:07,028 INFO [uber-SubtaskRunner]
org.apache.hadoop.mapred.TaskAttemptListenerImpl: Commit-pending state
update from attempt_1410195300700_18702_m_000000_0
2014-09-09 05:28:07,029 INFO [AsyncDispatcher event handler]
org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl:
attempt_1410195300700_18702_m_000000_0 TaskAttempt Transitioned from
RUNNING to COMMIT_PENDING
2014-09-09 05:28:07,029 INFO [uber-SubtaskRunner]
org.apache.hadoop.mapred.TaskAttemptListenerImpl: Commit go/no-go request
from attempt_1410195300700_18702_m_000000_0
2014-09-09 05:28:07,029 INFO [AsyncDispatcher event handler]
org.apache.hadoop.mapreduce.v2.app.job.impl.TaskImpl:
attempt_1410195300700_18702_m_000000_0 given a go for committing the task
output.
2014-09-09 05:28:08,029 INFO [uber-SubtaskRunner]
org.apache.hadoop.mapred.TaskAttemptListenerImpl: Commit go/no-go request
from attempt_1410195300700_18702_m_000000_0
2014-09-09 05:28:09,030 INFO [uber-SubtaskRunner]
org.apache.hadoop.mapred.TaskAttemptListenerImpl: Commit go/no-go request
from attempt_1410195300700_18702_m_000000_0
2014-09-09 05:28:09,968 INFO [communication thread]
org.apache.hadoop.mapred.TaskAttemptListenerImpl: Status update from
attempt_1410195300700_18702_m_000000_0
2014-09-09 05:28:09,968 INFO [communication thread]
org.apache.hadoop.mapred.TaskAttemptListenerImpl: Progress of TaskAttempt
attempt_1410195300700_18702_m_000000_0 is : 1.0
2014-09-09 05:28:10,030 INFO [uber-SubtaskRunner]
org.apache.hadoop.mapred.TaskAttemptListenerImpl: Commit go/no-go request
from attempt_1410195300700_18702_m_000000_0
2014-09-09 05:28:11,030 INFO [uber-SubtaskRunner]
org.apache.hadoop.mapred.TaskAttemptListenerImpl: Commit go/no-go request
from attempt_1410195300700_18702_m_000000_0
2014-09-09 05:28:12,031 INFO [uber-SubtaskRunner]
org.apache.hadoop.mapred.TaskAttemptListenerImpl: Commit go/no-go request
from attempt_1410195300700_18702_m_000000_0
2014-09-09 05:28:12,986 INFO [communication thread]
org.apache.hadoop.mapred.TaskAttemptListenerImpl: Status update from
attempt_1410195300700_18702_m_000000_0
2014-09-09 05:28:12,986 INFO [communication thread]
org.apache.hadoop.mapred.TaskAttemptListenerImpl: Progress of TaskAttempt
attempt_1410195300700_18702_m_000000_0 is : 1.0
2014-09-09 05:28:13,031 INFO [uber-SubtaskRunner]
org.apache.hadoop.mapred.TaskAttemptListenerImpl: Commit go/no-go request
from attempt_1410195300700_18702_m_000000_0
2014-09-09 05:28:14,031 INFO [uber-SubtaskRunner]
org.apache.hadoop.mapred.TaskAttemptListenerImpl: Commit go/no-go request
from attempt_1410195300700_18702_m_000000_0
2014-09-09 05:28:15,032 INFO [uber-SubtaskRunner]
org.apache.hadoop.mapred.TaskAttemptListenerImpl: Commit go/no-go request
from attempt_1410195300700_18702_m_000000_0
2014-09-09 05:28:16,001 INFO [communication thread]
org.apache.hadoop.mapred.TaskAttemptListenerImpl: Status update from
attempt_1410195300700_18702_m_000000_0
2014-09-09 05:28:16,002 INFO [communication thread]
org.apache.hadoop.mapred.TaskAttemptListenerImpl: Progress of TaskAttempt
attempt_1410195300700_18702_m_000000_0 is : 1.0
2014-09-09 05:28:16,032 INFO [uber-SubtaskRunner]
org.apache.hadoop.mapred.TaskAttemptListenerImpl: Commit go/no-go request
from attempt_1410195300700_18702_m_000000_0
2014-09-09 05:28:17,033 INFO [uber-SubtaskRunner]
org.apache.hadoop.mapred.TaskAttemptListenerImpl: Commit go/no-go request
from attempt_1410195300700_18702_m_000000_0
2014-09-09 05:28:18,033 INFO [uber-SubtaskRunner]
org.apache.hadoop.mapred.TaskAttemptListenerImpl: Commit go/no-go request
from attempt_1410195300700_18702_m_000000_0
2014-09-09 05:28:19,019 INFO [communication thread]
org.apache.hadoop.mapred.TaskAttemptListenerImpl: Status update from
attempt_1410195300700_18702_m_000000_0
2014-09-09 05:28:19,019 INFO [communication thread]
org.apache.hadoop.mapred.TaskAttemptListenerImpl: Progress of TaskAttempt
attempt_1410195300700_18702_m_000000_0 is : 1.0
2014-09-09 05:28:19,034 INFO [uber-SubtaskRunner]
org.apache.hadoop.mapred.TaskAttemptListenerImpl: Commit go/no-go request
from attempt_1410195300700_18702_m_000000_0
2014-09-09 05:28:20,034 INFO [uber-SubtaskRunner]
org.apache.hadoop.mapred.TaskAttemptListenerImpl: Commit go/no-go request
from attempt_1410195300700_18702_m_000000_0
2014-09-09 05:28:21,034 INFO [uber-SubtaskRunner]
org.apache.hadoop.mapred.TaskAttemptListenerImpl: Commit go/no-go request
from attempt_1410195300700_18702_m_000000_0
2014-09-09 05:28:22,034 INFO [communication thread]
org.apache.hadoop.mapred.TaskAttemptListenerImpl: Status update from
attempt_1410195300700_18702_m_000000_0
2014-09-09 05:28:22,034 INFO [communication thread]
org.apache.hadoop.mapred.TaskAttemptListenerImpl: Progress of TaskAttempt
attempt_1410195300700_18702_m_000000_0 is : 1.0
2014-09-09 05:28:22,034 INFO [uber-SubtaskRunner]
org.apache.hadoop.mapred.TaskAttemptListenerImpl: Commit go/no-go request
from attempt_1410195300700_18702_m_000000_0
2014-09-09 05:28:23,035 INFO [uber-SubtaskRunner]
org.apache.hadoop.mapred.TaskAttemptListenerImpl: Commit go/no-go request
from attempt_1410195300700_18702_m_000000_0
2014-09-09 05:28:24,035 INFO [uber-SubtaskRunner]
org.apache.hadoop.mapred.TaskAttemptListenerImpl: Commit go/no-go request
from attempt_1410195300700_18702_m_000000_0
2014-09-09 05:28:25,035 INFO [uber-SubtaskRunner]
org.apache.hadoop.mapred.TaskAttemptListenerImpl: Commit go/no-go request
from attempt_1410195300700_18702_m_000000_0
2014-09-09 05:28:25,049 INFO [communication thread]
org.apache.hadoop.mapred.TaskAttemptListenerImpl: Status update from
attempt_1410195300700_18702_m_000000_0
2014-09-09 05:28:25,049 INFO [communication thread]
org.apache.hadoop.mapred.TaskAttemptListenerImpl: Progress of TaskAttempt
attempt_1410195300700_18702_m_000000_0 is : 1.0
2014-09-09 05:28:26,035 INFO [uber-SubtaskRunner]
org.apache.hadoop.mapred.TaskAttemptListenerImpl: Commit go/no-go request
from attempt_1410195300700_18702_m_000000_0
2014-09-09 05:28:27,036 INFO [uber-SubtaskRunner]
org.apache.hadoop.mapred.TaskAttemptListenerImpl: Commit go/no-go request
from attempt_1410195300700_18702_m_000000_0
2014-09-09 05:28:27,036 INFO [uber-SubtaskRunner]
org.apache.hadoop.mapreduce.v2.app.job.impl.TaskImpl: Result of canCommit
for attempt_1410195300700_18702_m_000000_0:true
2014-09-09 05:28:27,036 INFO [uber-SubtaskRunner]
org.apache.hadoop.mapred.Task: Task attempt_1410195300700_18702_m_000000_0
is allowed to commit now
2014-09-09 05:28:27,088 INFO [uber-SubtaskRunner]
org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter: Saved output of
task 'attempt_1410195300700_18702_m_000000_0' to hdfs://
slc02knk.us.oracle.com:55310/tmp/thirdeye/Publish-28305282698003/_temporary/1/task_1410195300700_18702_m_000000
2014-09-09 05:28:27,104 INFO [uber-SubtaskRunner]
org.apache.hadoop.mapred.TaskAttemptListenerImpl: Status update from
attempt_1410195300700_18702_m_000000_0
2014-09-09 05:28:27,104 INFO [uber-SubtaskRunner]
org.apache.hadoop.mapred.TaskAttemptListenerImpl: Progress of TaskAttempt
attempt_1410195300700_18702_m_000000_0 is : 1.0
2014-09-09 05:28:27,105 INFO [uber-SubtaskRunner]
org.apache.hadoop.mapred.TaskAttemptListenerImpl: Done acknowledgement from
attempt_1410195300700_18702_m_000000_0
2014-09-09 05:28:27,105 INFO [uber-SubtaskRunner]
org.apache.hadoop.mapred.Task: Task
'attempt_1410195300700_18702_m_000000_0' done.
2014-09-09 05:28:27,107 INFO [AsyncDispatcher event handler]
org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl:
attempt_1410195300700_18702_m_000000_0 TaskAttempt Transitioned from
COMMIT_PENDING to SUCCESS_CONTAINER_CLEANUP
2014-09-09 05:28:27,107 INFO [uber-SubtaskRunner]
org.apache.hadoop.mapred.LocalContainerLauncher: Processing the event
EventType: CONTAINER_REMOTE_CLEANUP for container
container_1410195300700_18702_01_000001 taskAttempt
attempt_1410195300700_18702_m_000000_0
2014-09-09 05:28:27,111 INFO [AsyncDispatcher event handler]
org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl:
attempt_1410195300700_18702_m_000000_0 TaskAttempt Transitioned from
SUCCESS_CONTAINER_CLEANUP to SUCCEEDED
2014-09-09 05:28:27,124 INFO [AsyncDispatcher event handler]
org.apache.hadoop.mapreduce.v2.app.job.impl.TaskImpl: Task succeeded with
attempt attempt_1410195300700_18702_m_000000_0
2014-09-09 05:28:27,126 INFO [AsyncDispatcher event handler]
org.apache.hadoop.mapreduce.v2.app.job.impl.TaskImpl:
task_1410195300700_18702_m_000000 Task Transitioned from RUNNING to
SUCCEEDED


On Tue, Sep 9, 2014 at 10:58 AM, Arun Murthy <ac...@hortonworks.com> wrote:

> Thanks for digging into this. Mind opening a jira to discuss further? Much
> appreciated.
>
> Arun
>
> On Mon, Sep 8, 2014 at 7:15 PM, Anfernee Xu <an...@gmail.com> wrote:
>
>> It turned out that it's not a configuration issue, some worker thread
>> which submits job to Yarn was blocked, see below thread dump
>>
>> "pool-1-thread-160" id=194 idx=0x30c tid=886 prio=5 alive, blocked,
>> native_blocked
>>     -- Blocked trying to get lock:
>> org/apache/hadoop/ipc/Client$Connection@0x1059d0c60[thin lock]
>>     at __lll_lock_wait+36(:0)@0x340260d594
>>     at tsSleep+399(threadsystem.c:83)@0x2b2356e5da80
>>     at jrockit/vm/Threads.sleep(I)V(Native Method)
>>     at jrockit/vm/Locks.waitForThinRelease(Locks.java:955)[optimized]
>>     at
>> jrockit/vm/Locks.monitorEnterSecondStageHard(Locks.java:1083)[optimized]
>>     at
>> jrockit/vm/Locks.monitorEnterSecondStage(Locks.java:1005)[optimized]
>>     at
>> org/apache/hadoop/ipc/Client$Connection.addCall(Client.java:400)[inlined]
>>     at
>> org/apache/hadoop/ipc/Client$Connection.access$2500(Client.java:314)[inlined]
>>     at
>> org/apache/hadoop/ipc/Client.getConnection(Client.java:1393)[optimized]
>>     at org/apache/hadoop/ipc/Client.call(Client.java:1318)[inlined]
>>     at org/apache/hadoop/ipc/Client.call(Client.java:1300)[inlined]
>>     at
>> org/apache/hadoop/ipc/ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206)[optimized]
>>     at
>> $Proxy21.getJobReport(Lcom/google/protobuf/RpcController;Lorg/apache/hadoop/mapreduce/v2/proto/MRServiceProtos$GetJobReportRequestProto;)Lorg/apache/hadoop/mapreduce/v2/proto/MRServiceProtos$GetJobReportResponseProto;(Unknown
>> Source)
>>     at
>> org/apache/hadoop/mapreduce/v2/api/impl/pb/client/MRClientProtocolPBClientImpl.getJobReport(MRClientProtocolPBClientImpl.java:133)[optimized]
>>     at
>> sun/reflect/GeneratedMethodAccessor79.invoke(Ljava/lang/Object;[Ljava/lang/Object;)Ljava/lang/Object;(Unknown
>> Source)[optimized]
>>     at
>> sun/reflect/DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)[optimized]
>>     at java/lang/reflect/Method.invoke(Method.java:597)[inlined]
>>     at
>> org/apache/hadoop/mapred/ClientServiceDelegate.invoke(ClientServiceDelegate.java:317)[inlined]
>>     at
>> org/apache/hadoop/mapred/ClientServiceDelegate.getJobStatus(ClientServiceDelegate.java:416)[optimized]
>>     ^-- Holding lock:
>> org/apache/hadoop/mapred/ClientServiceDelegate@0x10087d788[biased lock]
>>     at
>> org/apache/hadoop/mapred/YarnRunner.getJobStatus(TIEYarnRunner.java:522)[optimized]
>>     at org/apache/hadoop/mapreduce/Job$1.run(Job.java:314)[inlined]
>>     at org/apache/hadoop/mapreduce/Job$1.run(Job.java:311)[inlined]
>>     at
>> jrockit/vm/AccessController.doPrivileged(AccessController.java:254)[inlined]
>>     at javax/security/auth/Subject.doAs(Subject.java:396)[inlined]
>>     at
>> org/apache/hadoop/security/UserGroupInformation.doAs(UserGroupInformation.java:1491)[inlined]
>>     at
>> org/apache/hadoop/mapreduce/Job.updateStatus(Job.java:311)[optimized]
>>     ^-- Holding lock: org/apache/hadoop/mapreduce/Job@0x100522fb8[biased
>> lock]
>>     at org/apache/hadoop/mapreduce/Job.isComplete(Job.java:599)
>>     at org/apache/hadoop/mapreduce/Job.waitForCompletion(Job.java:1294)
>>
>> The lock was held by
>>
>> "pool-1-thread-10" id=44 idx=0xb4 tid=736 prio=5 alive, sleeping,
>> native_waiting
>>     at pthread_cond_timedwait@@GLIBC_2.3.2+288(:0)@0x340260b1c0
>>     at eventTimedWaitNoTransitionImpl+46(event.c:93)@0x2b2356cc741f
>>     at
>> syncWaitForSignalNoTransition+133(synchronization.c:51)@0x2b2356e5a096
>>     at syncWaitForSignal+189(synchronization.c:85)@0x2b2356e5a1ae
>>     at vmtSleep+165(signaling.c:197)@0x2b2356e35ef6
>>     at JVM_Sleep+188(jvmthreads.c:119)@0x2b2356d6bb7d
>>     at java/lang/Thread.sleep(J)V(Native Method)
>>     at
>> org/apache/hadoop/ipc/Client$Connection.handleConnectionFailure(Client.java:778)[optimized]
>>     at
>> org/apache/hadoop/ipc/Client$Connection.setupConnection(Client.java:566)[optimized]
>>     ^-- Holding lock: org/apache/hadoop/ipc/Client$Connection@0x1059d0c60
>> [recursive]
>>     at
>> org/apache/hadoop/ipc/Client$Connection.setupIOstreams(Client.java:642)[optimized]
>>     ^-- Holding lock: org/apache/hadoop/ipc/Client$Connection@0x1059d0c60[thin
>> lock]
>>     at
>> org/apache/hadoop/ipc/Client$Connection.access$2600(Client.java:314)[inlined]
>>     at
>> org/apache/hadoop/ipc/Client.getConnection(Client.java:1399)[optimized]
>>     at org/apache/hadoop/ipc/Client.call(Client.java:1318)[inlined]
>>     at org/apache/hadoop/ipc/Client.call(Client.java:1300)[inlined]
>>     at
>> org/apache/hadoop/ipc/ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206)[optimized]
>>     at
>> $Proxy21.getJobReport(Lcom/google/protobuf/RpcController;Lorg/apache/hadoop/mapreduce/v2/proto/MRServiceProtos$GetJobReportRequestProto;)Lorg/apache/hadoop/mapreduce/v2/proto/MRServiceProtos$GetJobReportResponseProto;(Unknown
>> Source)
>>     at
>> org/apache/hadoop/mapreduce/v2/api/impl/pb/client/MRClientProtocolPBClientImpl.getJobReport(MRClientProtocolPBClientImpl.java:133)[optimized]
>>     at
>> sun/reflect/GeneratedMethodAccessor79.invoke(Ljava/lang/Object;[Ljava/lang/Object;)Ljava/lang/Object;(Unknown
>> Source)[optimized]
>>     at
>> sun/reflect/DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)[optimized]
>>     at java/lang/reflect/Method.invoke(Method.java:597)[inlined]
>>     at
>> org/apache/hadoop/mapred/ClientServiceDelegate.invoke(ClientServiceDelegate.java:317)[inlined]
>>     at
>> org/apache/hadoop/mapred/ClientServiceDelegate.getJobStatus(ClientServiceDelegate.java:416)[optimized]
>>     ^-- Holding lock:
>> org/apache/hadoop/mapred/ClientServiceDelegate@0x1012c34f8[biased lock]
>>     at
>> org/apache/hadoop/mapred/YarnRunner.getJobStatus(TIEYarnRunner.java:522)[optimized]
>>     at org/apache/hadoop/mapreduce/Job$1.run(Job.java:314)[inlined]
>>     at org/apache/hadoop/mapreduce/Job$1.run(Job.java:311)[inlined]
>>     at
>> jrockit/vm/AccessController.doPrivileged(AccessController.java:254)[inlined]
>>     at javax/security/auth/Subject.doAs(Subject.java:396)[inlined]
>>     at
>> org/apache/hadoop/security/UserGroupInformation.doAs(UserGroupInformation.java:1491)[inlined]
>>     at
>> org/apache/hadoop/mapreduce/Job.updateStatus(Job.java:311)[optimized]
>>     ^-- Holding lock: org/apache/hadoop/mapreduce/Job@0x1016e05a8[biased
>> lock]
>>     at org/apache/hadoop/mapreduce/Job.isComplete(Job.java:599)
>>     at org/apache/hadoop/mapreduce/Job.waitForCompletion(Job.java:1294)
>>
>> You can see the thead holding the lock is in sleep state and the calling
>> method is Connection.handleConnectionFailure(), so I checked the our log
>> file and realized the connection failure is about historyserver is not
>> available. In my case, I did not start historyserver at all, because it's
>> not needed(I disabled log-aggregation), so my question is why the job
>> client was still trying to talk to historyserver even log aggregation is
>> disabled.
>>
>> Thanks
>>
>>
>>
>> On Mon, Sep 8, 2014 at 3:57 AM, Arun Murthy <ac...@hortonworks.com> wrote:
>>
>>> How many nodes do you have in your cluster?
>>>
>>> Also, could you share the CapacityScheduler initialization logs for each
>>> queue, such as:
>>>
>>> 2014-08-14 15:14:23,835 INFO
>>> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler:
>>> Initialized queue: unfunded: capacity=0.5, absoluteCapacity=0.5,
>>> usedResources=<memory:0, vCores:0>, usedCapacity=0.0,
>>> absoluteUsedCapacity=0.0, numApps=0, numContainers=0
>>> 2014-08-14 15:14:23,840 INFO
>>> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue:
>>> Initializing default
>>> capacity = 0.5 [= (float) configuredCapacity / 100 ]
>>> asboluteCapacity = 0.5 [= parentAbsoluteCapacity * capacity ]
>>> maxCapacity = 1.0 [= configuredMaxCapacity ]
>>> absoluteMaxCapacity = 1.0 [= 1.0 maximumCapacity undefined,
>>> (parentAbsoluteMaxCapacity * maximumCapacity) / 100 otherwise ]
>>> userLimit = 100 [= configuredUserLimit ]
>>> userLimitFactor = 1.0 [= configuredUserLimitFactor ]
>>> maxApplications = 5000 [= configuredMaximumSystemApplicationsPerQueue or
>>> (int)(configuredMaximumSystemApplications * absoluteCapacity)]
>>> maxApplicationsPerUser = 5000 [= (int)(maxApplications * (userLimit /
>>> 100.0f) * userLimitFactor) ]
>>> maxActiveApplications = 1 [= max((int)ceil((clusterResourceMemory /
>>> minimumAllocation) * maxAMResourcePerQueuePercent * absoluteMaxCapacity),1)
>>> ]
>>> maxActiveAppsUsingAbsCap = 1 [= max((int)ceil((clusterResourceMemory /
>>> minimumAllocation) *maxAMResourcePercent * absoluteCapacity),1) ]
>>> maxActiveApplicationsPerUser = 1 [= max((int)(maxActiveApplications *
>>> (userLimit / 100.0f) * userLimitFactor),1) ]
>>> usedCapacity = 0.0 [= usedResourcesMemory / (clusterResourceMemory *
>>> absoluteCapacity)]
>>> absoluteUsedCapacity = 0.0 [= usedResourcesMemory /
>>> clusterResourceMemory]
>>> maxAMResourcePerQueuePercent = 0.1 [= configuredMaximumAMResourcePercent
>>> ]
>>> minimumAllocationFactor = 0.87506104 [= (float)(maximumAllocationMemory
>>> - minimumAllocationMemory) / maximumAllocationMemory ]
>>> numContainers = 0 [= currentNumContainers ]
>>> state = RUNNING [= configuredState ]
>>> acls = SUBMIT_APPLICATIONS: ADMINISTER_QUEUE:  [= configuredAcls ]
>>> nodeLocalityDelay = 0
>>>
>>>
>>> Then, look at values for maxActiveAppsUsingAbsCap &
>>> maxActiveApplicationsPerUser. That should help debugging.
>>>
>>> thanks,
>>> Arun
>>>
>>>
>>> On Sun, Sep 7, 2014 at 9:37 AM, Anfernee Xu <an...@gmail.com>
>>> wrote:
>>>
>>>> Hi,
>>>>
>>>> I'rm running my cluster at Hadoop 2.2.0,  and use CapacityScheduler.
>>>> And all my jobs are uberized and running among 2 queues, one queue takes
>>>> majority of capacity(90%), another take 10%. What I found is for small
>>>> queue, only one job is running for a given time, I tried twisting below
>>>> properties, but no luck so far, could you guys share some light on this?
>>>>
>>>>  <property>
>>>>     <name>yarn.scheduler.capacity.maximum-am-resource-percent</name>
>>>>     <value>1.0</value>
>>>>     <description>
>>>>       Maximum percent of resources in the cluster which can be used to
>>>> run
>>>>       application masters i.e. controls number of concurrent running
>>>>       applications.
>>>>     </description>
>>>>   </property>
>>>>
>>>>
>>>>  <property>
>>>>     <name>yarn.scheduler.capacity.root.queues</name>
>>>>     <value>default,small</value>
>>>>     <description>
>>>>       The queues at the this level (root is the root queue).
>>>>     </description>
>>>>   </property>
>>>>
>>>>  <property>
>>>>
>>>> <name>yarn.scheduler.capacity.root.small.maximum-am-resource-percent</name>
>>>>     <value>1.0</value>
>>>>   </property>
>>>>
>>>>
>>>>  <property>
>>>>     <name>yarn.scheduler.capacity.root.small.user-limit</name>
>>>>     <value>1</value>
>>>>   </property>
>>>>
>>>>  <property>
>>>>     <name>yarn.scheduler.capacity.root.default.capacity</name>
>>>>     <value>88</value>
>>>>     <description>Default queue target capacity.</description>
>>>>   </property>
>>>>
>>>>
>>>>   <property>
>>>>     <name>yarn.scheduler.capacity.root.small.capacity</name>
>>>>     <value>12</value>
>>>>     <description>Default queue target capacity.</description>
>>>>   </property>
>>>>
>>>>  <property>
>>>>     <name>yarn.scheduler.capacity.root.default.maximum-capacity</name>
>>>>     <value>88</value>
>>>>     <description>
>>>>       The maximum capacity of the default queue.
>>>>     </description>
>>>>   </property>
>>>>
>>>>   <property>
>>>>     <name>yarn.scheduler.capacity.root.small.maximum-capacity</name>
>>>>     <value>12</value>
>>>>     <description>Maximum queue capacity.</description>
>>>>   </property>
>>>>
>>>>
>>>> Thanks
>>>>
>>>> --
>>>> --Anfernee
>>>>
>>>
>>>
>>>
>>> --
>>>
>>> --
>>> Arun C. Murthy
>>> Hortonworks Inc.
>>> http://hortonworks.com/
>>>
>>> CONFIDENTIALITY NOTICE
>>> NOTICE: This message is intended for the use of the individual or entity
>>> to which it is addressed and may contain information that is confidential,
>>> privileged and exempt from disclosure under applicable law. If the reader
>>> of this message is not the intended recipient, you are hereby notified that
>>> any printing, copying, dissemination, distribution, disclosure or
>>> forwarding of this communication is strictly prohibited. If you have
>>> received this communication in error, please contact the sender immediately
>>> and delete it from your system. Thank You.
>>
>>
>>
>>
>> --
>> --Anfernee
>>
>
>
>
> --
>
> --
> Arun C. Murthy
> Hortonworks Inc.
> http://hortonworks.com/
>
> CONFIDENTIALITY NOTICE
> NOTICE: This message is intended for the use of the individual or entity
> to which it is addressed and may contain information that is confidential,
> privileged and exempt from disclosure under applicable law. If the reader
> of this message is not the intended recipient, you are hereby notified that
> any printing, copying, dissemination, distribution, disclosure or
> forwarding of this communication is strictly prohibited. If you have
> received this communication in error, please contact the sender immediately
> and delete it from your system. Thank You.
>



-- 
--Anfernee

Re: In Yarn how to increase the number of concurrent applications for a queue

Posted by Anfernee Xu <an...@gmail.com>.

Sure, I can open a jira, but how can I do it? I went to

https://issues.apache.org/jira/browse/YARN/?selectedTab=com.atlassian.jira.jira-projects-plugin:summary-panel

But I did not see any link can lead me to open a new jira? do I miss
something?

BTW, I found another interesting issue, as all our jobs are uberized and we
have 2 queues(default and small), all jobs for default queue are fine, but
jobs on small queue ran slowly compared to default queue, the major
difference is the time spent in job commit, as you can see from below log,
the user logic was finished at 05:28:06,984,
and then it kept going for 21 seconds, and at 05:28:27,036 the job was
allowed to commit, whereas on default queue, it only takes less than 1
second for this.

Do you have any idea about what can cause this? Is it due to the restricted
resource(small queue only has 10 nodes whereas default has 100 nodes).


2014-09-09 05:28:06,984 INFO [job-thread-8283272023] Job is Done
2014-09-09 05:28:06,985 INFO [uber-SubtaskRunner]
org.apache.hadoop.mapred.TaskAttemptListenerImpl: Status update from
attempt_1410195300700_18702_m_000000_0
2014-09-09 05:28:06,987 INFO [uber-SubtaskRunner]
org.apache.hadoop.mapred.TaskAttemptListenerImpl: Progress of TaskAttempt
attempt_1410195300700_18702_m_000000_0 is : 0.0
2014-09-09 05:28:07,004 INFO [uber-SubtaskRunner]
org.apache.hadoop.mapred.Task: Task:attempt_1410195300700_18702_m_000000_0
is done. And is in the process of committing
2014-09-09 05:28:07,028 INFO [uber-SubtaskRunner]
org.apache.hadoop.mapred.TaskAttemptListenerImpl: Commit-pending state
update from attempt_1410195300700_18702_m_000000_0
2014-09-09 05:28:07,029 INFO [AsyncDispatcher event handler]
org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl:
attempt_1410195300700_18702_m_000000_0 TaskAttempt Transitioned from
RUNNING to COMMIT_PENDING
2014-09-09 05:28:07,029 INFO [uber-SubtaskRunner]
org.apache.hadoop.mapred.TaskAttemptListenerImpl: Commit go/no-go request
from attempt_1410195300700_18702_m_000000_0
2014-09-09 05:28:07,029 INFO [AsyncDispatcher event handler]
org.apache.hadoop.mapreduce.v2.app.job.impl.TaskImpl:
attempt_1410195300700_18702_m_000000_0 given a go for committing the task
output.
2014-09-09 05:28:08,029 INFO [uber-SubtaskRunner]
org.apache.hadoop.mapred.TaskAttemptListenerImpl: Commit go/no-go request
from attempt_1410195300700_18702_m_000000_0
2014-09-09 05:28:09,030 INFO [uber-SubtaskRunner]
org.apache.hadoop.mapred.TaskAttemptListenerImpl: Commit go/no-go request
from attempt_1410195300700_18702_m_000000_0
2014-09-09 05:28:09,968 INFO [communication thread]
org.apache.hadoop.mapred.TaskAttemptListenerImpl: Status update from
attempt_1410195300700_18702_m_000000_0
2014-09-09 05:28:09,968 INFO [communication thread]
org.apache.hadoop.mapred.TaskAttemptListenerImpl: Progress of TaskAttempt
attempt_1410195300700_18702_m_000000_0 is : 1.0
2014-09-09 05:28:10,030 INFO [uber-SubtaskRunner]
org.apache.hadoop.mapred.TaskAttemptListenerImpl: Commit go/no-go request
from attempt_1410195300700_18702_m_000000_0
2014-09-09 05:28:11,030 INFO [uber-SubtaskRunner]
org.apache.hadoop.mapred.TaskAttemptListenerImpl: Commit go/no-go request
from attempt_1410195300700_18702_m_000000_0
2014-09-09 05:28:12,031 INFO [uber-SubtaskRunner]
org.apache.hadoop.mapred.TaskAttemptListenerImpl: Commit go/no-go request
from attempt_1410195300700_18702_m_000000_0
2014-09-09 05:28:12,986 INFO [communication thread]
org.apache.hadoop.mapred.TaskAttemptListenerImpl: Status update from
attempt_1410195300700_18702_m_000000_0
2014-09-09 05:28:12,986 INFO [communication thread]
org.apache.hadoop.mapred.TaskAttemptListenerImpl: Progress of TaskAttempt
attempt_1410195300700_18702_m_000000_0 is : 1.0
2014-09-09 05:28:13,031 INFO [uber-SubtaskRunner]
org.apache.hadoop.mapred.TaskAttemptListenerImpl: Commit go/no-go request
from attempt_1410195300700_18702_m_000000_0
2014-09-09 05:28:14,031 INFO [uber-SubtaskRunner]
org.apache.hadoop.mapred.TaskAttemptListenerImpl: Commit go/no-go request
from attempt_1410195300700_18702_m_000000_0
2014-09-09 05:28:15,032 INFO [uber-SubtaskRunner]
org.apache.hadoop.mapred.TaskAttemptListenerImpl: Commit go/no-go request
from attempt_1410195300700_18702_m_000000_0
2014-09-09 05:28:16,001 INFO [communication thread]
org.apache.hadoop.mapred.TaskAttemptListenerImpl: Status update from
attempt_1410195300700_18702_m_000000_0
2014-09-09 05:28:16,002 INFO [communication thread]
org.apache.hadoop.mapred.TaskAttemptListenerImpl: Progress of TaskAttempt
attempt_1410195300700_18702_m_000000_0 is : 1.0
2014-09-09 05:28:16,032 INFO [uber-SubtaskRunner]
org.apache.hadoop.mapred.TaskAttemptListenerImpl: Commit go/no-go request
from attempt_1410195300700_18702_m_000000_0
2014-09-09 05:28:17,033 INFO [uber-SubtaskRunner]
org.apache.hadoop.mapred.TaskAttemptListenerImpl: Commit go/no-go request
from attempt_1410195300700_18702_m_000000_0
2014-09-09 05:28:18,033 INFO [uber-SubtaskRunner]
org.apache.hadoop.mapred.TaskAttemptListenerImpl: Commit go/no-go request
from attempt_1410195300700_18702_m_000000_0
2014-09-09 05:28:19,019 INFO [communication thread]
org.apache.hadoop.mapred.TaskAttemptListenerImpl: Status update from
attempt_1410195300700_18702_m_000000_0
2014-09-09 05:28:19,019 INFO [communication thread]
org.apache.hadoop.mapred.TaskAttemptListenerImpl: Progress of TaskAttempt
attempt_1410195300700_18702_m_000000_0 is : 1.0
2014-09-09 05:28:19,034 INFO [uber-SubtaskRunner]
org.apache.hadoop.mapred.TaskAttemptListenerImpl: Commit go/no-go request
from attempt_1410195300700_18702_m_000000_0
2014-09-09 05:28:20,034 INFO [uber-SubtaskRunner]
org.apache.hadoop.mapred.TaskAttemptListenerImpl: Commit go/no-go request
from attempt_1410195300700_18702_m_000000_0
2014-09-09 05:28:21,034 INFO [uber-SubtaskRunner]
org.apache.hadoop.mapred.TaskAttemptListenerImpl: Commit go/no-go request
from attempt_1410195300700_18702_m_000000_0
2014-09-09 05:28:22,034 INFO [communication thread]
org.apache.hadoop.mapred.TaskAttemptListenerImpl: Status update from
attempt_1410195300700_18702_m_000000_0
2014-09-09 05:28:22,034 INFO [communication thread]
org.apache.hadoop.mapred.TaskAttemptListenerImpl: Progress of TaskAttempt
attempt_1410195300700_18702_m_000000_0 is : 1.0
2014-09-09 05:28:22,034 INFO [uber-SubtaskRunner]
org.apache.hadoop.mapred.TaskAttemptListenerImpl: Commit go/no-go request
from attempt_1410195300700_18702_m_000000_0
2014-09-09 05:28:23,035 INFO [uber-SubtaskRunner]
org.apache.hadoop.mapred.TaskAttemptListenerImpl: Commit go/no-go request
from attempt_1410195300700_18702_m_000000_0
2014-09-09 05:28:24,035 INFO [uber-SubtaskRunner]
org.apache.hadoop.mapred.TaskAttemptListenerImpl: Commit go/no-go request
from attempt_1410195300700_18702_m_000000_0
2014-09-09 05:28:25,035 INFO [uber-SubtaskRunner]
org.apache.hadoop.mapred.TaskAttemptListenerImpl: Commit go/no-go request
from attempt_1410195300700_18702_m_000000_0
2014-09-09 05:28:25,049 INFO [communication thread]
org.apache.hadoop.mapred.TaskAttemptListenerImpl: Status update from
attempt_1410195300700_18702_m_000000_0
2014-09-09 05:28:25,049 INFO [communication thread]
org.apache.hadoop.mapred.TaskAttemptListenerImpl: Progress of TaskAttempt
attempt_1410195300700_18702_m_000000_0 is : 1.0
2014-09-09 05:28:26,035 INFO [uber-SubtaskRunner]
org.apache.hadoop.mapred.TaskAttemptListenerImpl: Commit go/no-go request
from attempt_1410195300700_18702_m_000000_0
2014-09-09 05:28:27,036 INFO [uber-SubtaskRunner]
org.apache.hadoop.mapred.TaskAttemptListenerImpl: Commit go/no-go request
from attempt_1410195300700_18702_m_000000_0
2014-09-09 05:28:27,036 INFO [uber-SubtaskRunner]
org.apache.hadoop.mapreduce.v2.app.job.impl.TaskImpl: Result of canCommit
for attempt_1410195300700_18702_m_000000_0:true
2014-09-09 05:28:27,036 INFO [uber-SubtaskRunner]
org.apache.hadoop.mapred.Task: Task attempt_1410195300700_18702_m_000000_0
is allowed to commit now
2014-09-09 05:28:27,088 INFO [uber-SubtaskRunner]
org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter: Saved output of
task 'attempt_1410195300700_18702_m_000000_0' to hdfs://
slc02knk.us.oracle.com:55310/tmp/thirdeye/Publish-28305282698003/_temporary/1/task_1410195300700_18702_m_000000
2014-09-09 05:28:27,104 INFO [uber-SubtaskRunner]
org.apache.hadoop.mapred.TaskAttemptListenerImpl: Status update from
attempt_1410195300700_18702_m_000000_0
2014-09-09 05:28:27,104 INFO [uber-SubtaskRunner]
org.apache.hadoop.mapred.TaskAttemptListenerImpl: Progress of TaskAttempt
attempt_1410195300700_18702_m_000000_0 is : 1.0
2014-09-09 05:28:27,105 INFO [uber-SubtaskRunner]
org.apache.hadoop.mapred.TaskAttemptListenerImpl: Done acknowledgement from
attempt_1410195300700_18702_m_000000_0
2014-09-09 05:28:27,105 INFO [uber-SubtaskRunner]
org.apache.hadoop.mapred.Task: Task
'attempt_1410195300700_18702_m_000000_0' done.
2014-09-09 05:28:27,107 INFO [AsyncDispatcher event handler]
org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl:
attempt_1410195300700_18702_m_000000_0 TaskAttempt Transitioned from
COMMIT_PENDING to SUCCESS_CONTAINER_CLEANUP
2014-09-09 05:28:27,107 INFO [uber-SubtaskRunner]
org.apache.hadoop.mapred.LocalContainerLauncher: Processing the event
EventType: CONTAINER_REMOTE_CLEANUP for container
container_1410195300700_18702_01_000001 taskAttempt
attempt_1410195300700_18702_m_000000_0
2014-09-09 05:28:27,111 INFO [AsyncDispatcher event handler]
org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl:
attempt_1410195300700_18702_m_000000_0 TaskAttempt Transitioned from
SUCCESS_CONTAINER_CLEANUP to SUCCEEDED
2014-09-09 05:28:27,124 INFO [AsyncDispatcher event handler]
org.apache.hadoop.mapreduce.v2.app.job.impl.TaskImpl: Task succeeded with
attempt attempt_1410195300700_18702_m_000000_0
2014-09-09 05:28:27,126 INFO [AsyncDispatcher event handler]
org.apache.hadoop.mapreduce.v2.app.job.impl.TaskImpl:
task_1410195300700_18702_m_000000 Task Transitioned from RUNNING to
SUCCEEDED


On Tue, Sep 9, 2014 at 10:58 AM, Arun Murthy <ac...@hortonworks.com> wrote:

> Thanks for digging into this. Mind opening a jira to discuss further? Much
> appreciated.
>
> Arun
>
> On Mon, Sep 8, 2014 at 7:15 PM, Anfernee Xu <an...@gmail.com> wrote:
>
>> It turned out that it's not a configuration issue, some worker thread
>> which submits job to Yarn was blocked, see below thread dump
>>
>> "pool-1-thread-160" id=194 idx=0x30c tid=886 prio=5 alive, blocked,
>> native_blocked
>>     -- Blocked trying to get lock:
>> org/apache/hadoop/ipc/Client$Connection@0x1059d0c60[thin lock]
>>     at __lll_lock_wait+36(:0)@0x340260d594
>>     at tsSleep+399(threadsystem.c:83)@0x2b2356e5da80
>>     at jrockit/vm/Threads.sleep(I)V(Native Method)
>>     at jrockit/vm/Locks.waitForThinRelease(Locks.java:955)[optimized]
>>     at
>> jrockit/vm/Locks.monitorEnterSecondStageHard(Locks.java:1083)[optimized]
>>     at
>> jrockit/vm/Locks.monitorEnterSecondStage(Locks.java:1005)[optimized]
>>     at
>> org/apache/hadoop/ipc/Client$Connection.addCall(Client.java:400)[inlined]
>>     at
>> org/apache/hadoop/ipc/Client$Connection.access$2500(Client.java:314)[inlined]
>>     at
>> org/apache/hadoop/ipc/Client.getConnection(Client.java:1393)[optimized]
>>     at org/apache/hadoop/ipc/Client.call(Client.java:1318)[inlined]
>>     at org/apache/hadoop/ipc/Client.call(Client.java:1300)[inlined]
>>     at
>> org/apache/hadoop/ipc/ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206)[optimized]
>>     at
>> $Proxy21.getJobReport(Lcom/google/protobuf/RpcController;Lorg/apache/hadoop/mapreduce/v2/proto/MRServiceProtos$GetJobReportRequestProto;)Lorg/apache/hadoop/mapreduce/v2/proto/MRServiceProtos$GetJobReportResponseProto;(Unknown
>> Source)
>>     at
>> org/apache/hadoop/mapreduce/v2/api/impl/pb/client/MRClientProtocolPBClientImpl.getJobReport(MRClientProtocolPBClientImpl.java:133)[optimized]
>>     at
>> sun/reflect/GeneratedMethodAccessor79.invoke(Ljava/lang/Object;[Ljava/lang/Object;)Ljava/lang/Object;(Unknown
>> Source)[optimized]
>>     at
>> sun/reflect/DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)[optimized]
>>     at java/lang/reflect/Method.invoke(Method.java:597)[inlined]
>>     at
>> org/apache/hadoop/mapred/ClientServiceDelegate.invoke(ClientServiceDelegate.java:317)[inlined]
>>     at
>> org/apache/hadoop/mapred/ClientServiceDelegate.getJobStatus(ClientServiceDelegate.java:416)[optimized]
>>     ^-- Holding lock:
>> org/apache/hadoop/mapred/ClientServiceDelegate@0x10087d788[biased lock]
>>     at
>> org/apache/hadoop/mapred/YarnRunner.getJobStatus(TIEYarnRunner.java:522)[optimized]
>>     at org/apache/hadoop/mapreduce/Job$1.run(Job.java:314)[inlined]
>>     at org/apache/hadoop/mapreduce/Job$1.run(Job.java:311)[inlined]
>>     at
>> jrockit/vm/AccessController.doPrivileged(AccessController.java:254)[inlined]
>>     at javax/security/auth/Subject.doAs(Subject.java:396)[inlined]
>>     at
>> org/apache/hadoop/security/UserGroupInformation.doAs(UserGroupInformation.java:1491)[inlined]
>>     at
>> org/apache/hadoop/mapreduce/Job.updateStatus(Job.java:311)[optimized]
>>     ^-- Holding lock: org/apache/hadoop/mapreduce/Job@0x100522fb8[biased
>> lock]
>>     at org/apache/hadoop/mapreduce/Job.isComplete(Job.java:599)
>>     at org/apache/hadoop/mapreduce/Job.waitForCompletion(Job.java:1294)
>>
>> The lock was held by
>>
>> "pool-1-thread-10" id=44 idx=0xb4 tid=736 prio=5 alive, sleeping,
>> native_waiting
>>     at pthread_cond_timedwait@@GLIBC_2.3.2+288(:0)@0x340260b1c0
>>     at eventTimedWaitNoTransitionImpl+46(event.c:93)@0x2b2356cc741f
>>     at
>> syncWaitForSignalNoTransition+133(synchronization.c:51)@0x2b2356e5a096
>>     at syncWaitForSignal+189(synchronization.c:85)@0x2b2356e5a1ae
>>     at vmtSleep+165(signaling.c:197)@0x2b2356e35ef6
>>     at JVM_Sleep+188(jvmthreads.c:119)@0x2b2356d6bb7d
>>     at java/lang/Thread.sleep(J)V(Native Method)
>>     at
>> org/apache/hadoop/ipc/Client$Connection.handleConnectionFailure(Client.java:778)[optimized]
>>     at
>> org/apache/hadoop/ipc/Client$Connection.setupConnection(Client.java:566)[optimized]
>>     ^-- Holding lock: org/apache/hadoop/ipc/Client$Connection@0x1059d0c60
>> [recursive]
>>     at
>> org/apache/hadoop/ipc/Client$Connection.setupIOstreams(Client.java:642)[optimized]
>>     ^-- Holding lock: org/apache/hadoop/ipc/Client$Connection@0x1059d0c60[thin
>> lock]
>>     at
>> org/apache/hadoop/ipc/Client$Connection.access$2600(Client.java:314)[inlined]
>>     at
>> org/apache/hadoop/ipc/Client.getConnection(Client.java:1399)[optimized]
>>     at org/apache/hadoop/ipc/Client.call(Client.java:1318)[inlined]
>>     at org/apache/hadoop/ipc/Client.call(Client.java:1300)[inlined]
>>     at
>> org/apache/hadoop/ipc/ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206)[optimized]
>>     at
>> $Proxy21.getJobReport(Lcom/google/protobuf/RpcController;Lorg/apache/hadoop/mapreduce/v2/proto/MRServiceProtos$GetJobReportRequestProto;)Lorg/apache/hadoop/mapreduce/v2/proto/MRServiceProtos$GetJobReportResponseProto;(Unknown
>> Source)
>>     at
>> org/apache/hadoop/mapreduce/v2/api/impl/pb/client/MRClientProtocolPBClientImpl.getJobReport(MRClientProtocolPBClientImpl.java:133)[optimized]
>>     at
>> sun/reflect/GeneratedMethodAccessor79.invoke(Ljava/lang/Object;[Ljava/lang/Object;)Ljava/lang/Object;(Unknown
>> Source)[optimized]
>>     at
>> sun/reflect/DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)[optimized]
>>     at java/lang/reflect/Method.invoke(Method.java:597)[inlined]
>>     at
>> org/apache/hadoop/mapred/ClientServiceDelegate.invoke(ClientServiceDelegate.java:317)[inlined]
>>     at
>> org/apache/hadoop/mapred/ClientServiceDelegate.getJobStatus(ClientServiceDelegate.java:416)[optimized]
>>     ^-- Holding lock:
>> org/apache/hadoop/mapred/ClientServiceDelegate@0x1012c34f8[biased lock]
>>     at
>> org/apache/hadoop/mapred/YarnRunner.getJobStatus(TIEYarnRunner.java:522)[optimized]
>>     at org/apache/hadoop/mapreduce/Job$1.run(Job.java:314)[inlined]
>>     at org/apache/hadoop/mapreduce/Job$1.run(Job.java:311)[inlined]
>>     at
>> jrockit/vm/AccessController.doPrivileged(AccessController.java:254)[inlined]
>>     at javax/security/auth/Subject.doAs(Subject.java:396)[inlined]
>>     at
>> org/apache/hadoop/security/UserGroupInformation.doAs(UserGroupInformation.java:1491)[inlined]
>>     at
>> org/apache/hadoop/mapreduce/Job.updateStatus(Job.java:311)[optimized]
>>     ^-- Holding lock: org/apache/hadoop/mapreduce/Job@0x1016e05a8[biased
>> lock]
>>     at org/apache/hadoop/mapreduce/Job.isComplete(Job.java:599)
>>     at org/apache/hadoop/mapreduce/Job.waitForCompletion(Job.java:1294)
>>
>> You can see the thead holding the lock is in sleep state and the calling
>> method is Connection.handleConnectionFailure(), so I checked the our log
>> file and realized the connection failure is about historyserver is not
>> available. In my case, I did not start historyserver at all, because it's
>> not needed(I disabled log-aggregation), so my question is why the job
>> client was still trying to talk to historyserver even log aggregation is
>> disabled.
>>
>> Thanks
>>
>>
>>
>> On Mon, Sep 8, 2014 at 3:57 AM, Arun Murthy <ac...@hortonworks.com> wrote:
>>
>>> How many nodes do you have in your cluster?
>>>
>>> Also, could you share the CapacityScheduler initialization logs for each
>>> queue, such as:
>>>
>>> 2014-08-14 15:14:23,835 INFO
>>> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler:
>>> Initialized queue: unfunded: capacity=0.5, absoluteCapacity=0.5,
>>> usedResources=<memory:0, vCores:0>, usedCapacity=0.0,
>>> absoluteUsedCapacity=0.0, numApps=0, numContainers=0
>>> 2014-08-14 15:14:23,840 INFO
>>> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue:
>>> Initializing default
>>> capacity = 0.5 [= (float) configuredCapacity / 100 ]
>>> asboluteCapacity = 0.5 [= parentAbsoluteCapacity * capacity ]
>>> maxCapacity = 1.0 [= configuredMaxCapacity ]
>>> absoluteMaxCapacity = 1.0 [= 1.0 maximumCapacity undefined,
>>> (parentAbsoluteMaxCapacity * maximumCapacity) / 100 otherwise ]
>>> userLimit = 100 [= configuredUserLimit ]
>>> userLimitFactor = 1.0 [= configuredUserLimitFactor ]
>>> maxApplications = 5000 [= configuredMaximumSystemApplicationsPerQueue or
>>> (int)(configuredMaximumSystemApplications * absoluteCapacity)]
>>> maxApplicationsPerUser = 5000 [= (int)(maxApplications * (userLimit /
>>> 100.0f) * userLimitFactor) ]
>>> maxActiveApplications = 1 [= max((int)ceil((clusterResourceMemory /
>>> minimumAllocation) * maxAMResourcePerQueuePercent * absoluteMaxCapacity),1)
>>> ]
>>> maxActiveAppsUsingAbsCap = 1 [= max((int)ceil((clusterResourceMemory /
>>> minimumAllocation) *maxAMResourcePercent * absoluteCapacity),1) ]
>>> maxActiveApplicationsPerUser = 1 [= max((int)(maxActiveApplications *
>>> (userLimit / 100.0f) * userLimitFactor),1) ]
>>> usedCapacity = 0.0 [= usedResourcesMemory / (clusterResourceMemory *
>>> absoluteCapacity)]
>>> absoluteUsedCapacity = 0.0 [= usedResourcesMemory /
>>> clusterResourceMemory]
>>> maxAMResourcePerQueuePercent = 0.1 [= configuredMaximumAMResourcePercent
>>> ]
>>> minimumAllocationFactor = 0.87506104 [= (float)(maximumAllocationMemory
>>> - minimumAllocationMemory) / maximumAllocationMemory ]
>>> numContainers = 0 [= currentNumContainers ]
>>> state = RUNNING [= configuredState ]
>>> acls = SUBMIT_APPLICATIONS: ADMINISTER_QUEUE:  [= configuredAcls ]
>>> nodeLocalityDelay = 0
>>>
>>>
>>> Then, look at values for maxActiveAppsUsingAbsCap &
>>> maxActiveApplicationsPerUser. That should help debugging.
>>>
>>> thanks,
>>> Arun
>>>
>>>
>>> On Sun, Sep 7, 2014 at 9:37 AM, Anfernee Xu <an...@gmail.com>
>>> wrote:
>>>
>>>> Hi,
>>>>
>>>> I'rm running my cluster at Hadoop 2.2.0,  and use CapacityScheduler.
>>>> And all my jobs are uberized and running among 2 queues, one queue takes
>>>> majority of capacity(90%), another take 10%. What I found is for small
>>>> queue, only one job is running for a given time, I tried twisting below
>>>> properties, but no luck so far, could you guys share some light on this?
>>>>
>>>>  <property>
>>>>     <name>yarn.scheduler.capacity.maximum-am-resource-percent</name>
>>>>     <value>1.0</value>
>>>>     <description>
>>>>       Maximum percent of resources in the cluster which can be used to
>>>> run
>>>>       application masters i.e. controls number of concurrent running
>>>>       applications.
>>>>     </description>
>>>>   </property>
>>>>
>>>>
>>>>  <property>
>>>>     <name>yarn.scheduler.capacity.root.queues</name>
>>>>     <value>default,small</value>
>>>>     <description>
>>>>       The queues at the this level (root is the root queue).
>>>>     </description>
>>>>   </property>
>>>>
>>>>  <property>
>>>>
>>>> <name>yarn.scheduler.capacity.root.small.maximum-am-resource-percent</name>
>>>>     <value>1.0</value>
>>>>   </property>
>>>>
>>>>
>>>>  <property>
>>>>     <name>yarn.scheduler.capacity.root.small.user-limit</name>
>>>>     <value>1</value>
>>>>   </property>
>>>>
>>>>  <property>
>>>>     <name>yarn.scheduler.capacity.root.default.capacity</name>
>>>>     <value>88</value>
>>>>     <description>Default queue target capacity.</description>
>>>>   </property>
>>>>
>>>>
>>>>   <property>
>>>>     <name>yarn.scheduler.capacity.root.small.capacity</name>
>>>>     <value>12</value>
>>>>     <description>Default queue target capacity.</description>
>>>>   </property>
>>>>
>>>>  <property>
>>>>     <name>yarn.scheduler.capacity.root.default.maximum-capacity</name>
>>>>     <value>88</value>
>>>>     <description>
>>>>       The maximum capacity of the default queue.
>>>>     </description>
>>>>   </property>
>>>>
>>>>   <property>
>>>>     <name>yarn.scheduler.capacity.root.small.maximum-capacity</name>
>>>>     <value>12</value>
>>>>     <description>Maximum queue capacity.</description>
>>>>   </property>
>>>>
>>>>
>>>> Thanks
>>>>
>>>> --
>>>> --Anfernee
>>>>
>>>
>>>
>>>
>>> --
>>>
>>> --
>>> Arun C. Murthy
>>> Hortonworks Inc.
>>> http://hortonworks.com/
>>>
>>> CONFIDENTIALITY NOTICE
>>> NOTICE: This message is intended for the use of the individual or entity
>>> to which it is addressed and may contain information that is confidential,
>>> privileged and exempt from disclosure under applicable law. If the reader
>>> of this message is not the intended recipient, you are hereby notified that
>>> any printing, copying, dissemination, distribution, disclosure or
>>> forwarding of this communication is strictly prohibited. If you have
>>> received this communication in error, please contact the sender immediately
>>> and delete it from your system. Thank You.
>>
>>
>>
>>
>> --
>> --Anfernee
>>
>
>
>
> --
>
> --
> Arun C. Murthy
> Hortonworks Inc.
> http://hortonworks.com/
>
> CONFIDENTIALITY NOTICE
> NOTICE: This message is intended for the use of the individual or entity
> to which it is addressed and may contain information that is confidential,
> privileged and exempt from disclosure under applicable law. If the reader
> of this message is not the intended recipient, you are hereby notified that
> any printing, copying, dissemination, distribution, disclosure or
> forwarding of this communication is strictly prohibited. If you have
> received this communication in error, please contact the sender immediately
> and delete it from your system. Thank You.
>



-- 
--Anfernee

Re: In Yarn how to increase the number of concurrent applications for a queue

Posted by Anfernee Xu <an...@gmail.com>.

Sure, I can open a jira, but how can I do it? I went to

https://issues.apache.org/jira/browse/YARN/?selectedTab=com.atlassian.jira.jira-projects-plugin:summary-panel

But I did not see any link can lead me to open a new jira? do I miss
something?

BTW, I found another interesting issue, as all our jobs are uberized and we
have 2 queues(default and small), all jobs for default queue are fine, but
jobs on small queue ran slowly compared to default queue, the major
difference is the time spent in job commit, as you can see from below log,
the user logic was finished at 05:28:06,984,
and then it kept going for 21 seconds, and at 05:28:27,036 the job was
allowed to commit, whereas on default queue, it only takes less than 1
second for this.

Do you have any idea about what can cause this? Is it due to the restricted
resource(small queue only has 10 nodes whereas default has 100 nodes).


2014-09-09 05:28:06,984 INFO [job-thread-8283272023] Job is Done
2014-09-09 05:28:06,985 INFO [uber-SubtaskRunner]
org.apache.hadoop.mapred.TaskAttemptListenerImpl: Status update from
attempt_1410195300700_18702_m_000000_0
2014-09-09 05:28:06,987 INFO [uber-SubtaskRunner]
org.apache.hadoop.mapred.TaskAttemptListenerImpl: Progress of TaskAttempt
attempt_1410195300700_18702_m_000000_0 is : 0.0
2014-09-09 05:28:07,004 INFO [uber-SubtaskRunner]
org.apache.hadoop.mapred.Task: Task:attempt_1410195300700_18702_m_000000_0
is done. And is in the process of committing
2014-09-09 05:28:07,028 INFO [uber-SubtaskRunner]
org.apache.hadoop.mapred.TaskAttemptListenerImpl: Commit-pending state
update from attempt_1410195300700_18702_m_000000_0
2014-09-09 05:28:07,029 INFO [AsyncDispatcher event handler]
org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl:
attempt_1410195300700_18702_m_000000_0 TaskAttempt Transitioned from
RUNNING to COMMIT_PENDING
2014-09-09 05:28:07,029 INFO [uber-SubtaskRunner]
org.apache.hadoop.mapred.TaskAttemptListenerImpl: Commit go/no-go request
from attempt_1410195300700_18702_m_000000_0
2014-09-09 05:28:07,029 INFO [AsyncDispatcher event handler]
org.apache.hadoop.mapreduce.v2.app.job.impl.TaskImpl:
attempt_1410195300700_18702_m_000000_0 given a go for committing the task
output.
2014-09-09 05:28:08,029 INFO [uber-SubtaskRunner]
org.apache.hadoop.mapred.TaskAttemptListenerImpl: Commit go/no-go request
from attempt_1410195300700_18702_m_000000_0
2014-09-09 05:28:09,030 INFO [uber-SubtaskRunner]
org.apache.hadoop.mapred.TaskAttemptListenerImpl: Commit go/no-go request
from attempt_1410195300700_18702_m_000000_0
2014-09-09 05:28:09,968 INFO [communication thread]
org.apache.hadoop.mapred.TaskAttemptListenerImpl: Status update from
attempt_1410195300700_18702_m_000000_0
2014-09-09 05:28:09,968 INFO [communication thread]
org.apache.hadoop.mapred.TaskAttemptListenerImpl: Progress of TaskAttempt
attempt_1410195300700_18702_m_000000_0 is : 1.0
2014-09-09 05:28:10,030 INFO [uber-SubtaskRunner]
org.apache.hadoop.mapred.TaskAttemptListenerImpl: Commit go/no-go request
from attempt_1410195300700_18702_m_000000_0
2014-09-09 05:28:11,030 INFO [uber-SubtaskRunner]
org.apache.hadoop.mapred.TaskAttemptListenerImpl: Commit go/no-go request
from attempt_1410195300700_18702_m_000000_0
2014-09-09 05:28:12,031 INFO [uber-SubtaskRunner]
org.apache.hadoop.mapred.TaskAttemptListenerImpl: Commit go/no-go request
from attempt_1410195300700_18702_m_000000_0
2014-09-09 05:28:12,986 INFO [communication thread]
org.apache.hadoop.mapred.TaskAttemptListenerImpl: Status update from
attempt_1410195300700_18702_m_000000_0
2014-09-09 05:28:12,986 INFO [communication thread]
org.apache.hadoop.mapred.TaskAttemptListenerImpl: Progress of TaskAttempt
attempt_1410195300700_18702_m_000000_0 is : 1.0
2014-09-09 05:28:13,031 INFO [uber-SubtaskRunner]
org.apache.hadoop.mapred.TaskAttemptListenerImpl: Commit go/no-go request
from attempt_1410195300700_18702_m_000000_0
2014-09-09 05:28:14,031 INFO [uber-SubtaskRunner]
org.apache.hadoop.mapred.TaskAttemptListenerImpl: Commit go/no-go request
from attempt_1410195300700_18702_m_000000_0
2014-09-09 05:28:15,032 INFO [uber-SubtaskRunner]
org.apache.hadoop.mapred.TaskAttemptListenerImpl: Commit go/no-go request
from attempt_1410195300700_18702_m_000000_0
2014-09-09 05:28:16,001 INFO [communication thread]
org.apache.hadoop.mapred.TaskAttemptListenerImpl: Status update from
attempt_1410195300700_18702_m_000000_0
2014-09-09 05:28:16,002 INFO [communication thread]
org.apache.hadoop.mapred.TaskAttemptListenerImpl: Progress of TaskAttempt
attempt_1410195300700_18702_m_000000_0 is : 1.0
2014-09-09 05:28:16,032 INFO [uber-SubtaskRunner]
org.apache.hadoop.mapred.TaskAttemptListenerImpl: Commit go/no-go request
from attempt_1410195300700_18702_m_000000_0
2014-09-09 05:28:17,033 INFO [uber-SubtaskRunner]
org.apache.hadoop.mapred.TaskAttemptListenerImpl: Commit go/no-go request
from attempt_1410195300700_18702_m_000000_0
2014-09-09 05:28:18,033 INFO [uber-SubtaskRunner]
org.apache.hadoop.mapred.TaskAttemptListenerImpl: Commit go/no-go request
from attempt_1410195300700_18702_m_000000_0
2014-09-09 05:28:19,019 INFO [communication thread]
org.apache.hadoop.mapred.TaskAttemptListenerImpl: Status update from
attempt_1410195300700_18702_m_000000_0
2014-09-09 05:28:19,019 INFO [communication thread]
org.apache.hadoop.mapred.TaskAttemptListenerImpl: Progress of TaskAttempt
attempt_1410195300700_18702_m_000000_0 is : 1.0
2014-09-09 05:28:19,034 INFO [uber-SubtaskRunner]
org.apache.hadoop.mapred.TaskAttemptListenerImpl: Commit go/no-go request
from attempt_1410195300700_18702_m_000000_0
2014-09-09 05:28:20,034 INFO [uber-SubtaskRunner]
org.apache.hadoop.mapred.TaskAttemptListenerImpl: Commit go/no-go request
from attempt_1410195300700_18702_m_000000_0
2014-09-09 05:28:21,034 INFO [uber-SubtaskRunner]
org.apache.hadoop.mapred.TaskAttemptListenerImpl: Commit go/no-go request
from attempt_1410195300700_18702_m_000000_0
2014-09-09 05:28:22,034 INFO [communication thread]
org.apache.hadoop.mapred.TaskAttemptListenerImpl: Status update from
attempt_1410195300700_18702_m_000000_0
2014-09-09 05:28:22,034 INFO [communication thread]
org.apache.hadoop.mapred.TaskAttemptListenerImpl: Progress of TaskAttempt
attempt_1410195300700_18702_m_000000_0 is : 1.0
2014-09-09 05:28:22,034 INFO [uber-SubtaskRunner]
org.apache.hadoop.mapred.TaskAttemptListenerImpl: Commit go/no-go request
from attempt_1410195300700_18702_m_000000_0
2014-09-09 05:28:23,035 INFO [uber-SubtaskRunner]
org.apache.hadoop.mapred.TaskAttemptListenerImpl: Commit go/no-go request
from attempt_1410195300700_18702_m_000000_0
2014-09-09 05:28:24,035 INFO [uber-SubtaskRunner]
org.apache.hadoop.mapred.TaskAttemptListenerImpl: Commit go/no-go request
from attempt_1410195300700_18702_m_000000_0
2014-09-09 05:28:25,035 INFO [uber-SubtaskRunner]
org.apache.hadoop.mapred.TaskAttemptListenerImpl: Commit go/no-go request
from attempt_1410195300700_18702_m_000000_0
2014-09-09 05:28:25,049 INFO [communication thread]
org.apache.hadoop.mapred.TaskAttemptListenerImpl: Status update from
attempt_1410195300700_18702_m_000000_0
2014-09-09 05:28:25,049 INFO [communication thread]
org.apache.hadoop.mapred.TaskAttemptListenerImpl: Progress of TaskAttempt
attempt_1410195300700_18702_m_000000_0 is : 1.0
2014-09-09 05:28:26,035 INFO [uber-SubtaskRunner]
org.apache.hadoop.mapred.TaskAttemptListenerImpl: Commit go/no-go request
from attempt_1410195300700_18702_m_000000_0
2014-09-09 05:28:27,036 INFO [uber-SubtaskRunner]
org.apache.hadoop.mapred.TaskAttemptListenerImpl: Commit go/no-go request
from attempt_1410195300700_18702_m_000000_0
2014-09-09 05:28:27,036 INFO [uber-SubtaskRunner]
org.apache.hadoop.mapreduce.v2.app.job.impl.TaskImpl: Result of canCommit
for attempt_1410195300700_18702_m_000000_0:true
2014-09-09 05:28:27,036 INFO [uber-SubtaskRunner]
org.apache.hadoop.mapred.Task: Task attempt_1410195300700_18702_m_000000_0
is allowed to commit now
2014-09-09 05:28:27,088 INFO [uber-SubtaskRunner]
org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter: Saved output of
task 'attempt_1410195300700_18702_m_000000_0' to hdfs://
slc02knk.us.oracle.com:55310/tmp/thirdeye/Publish-28305282698003/_temporary/1/task_1410195300700_18702_m_000000
2014-09-09 05:28:27,104 INFO [uber-SubtaskRunner]
org.apache.hadoop.mapred.TaskAttemptListenerImpl: Status update from
attempt_1410195300700_18702_m_000000_0
2014-09-09 05:28:27,104 INFO [uber-SubtaskRunner]
org.apache.hadoop.mapred.TaskAttemptListenerImpl: Progress of TaskAttempt
attempt_1410195300700_18702_m_000000_0 is : 1.0
2014-09-09 05:28:27,105 INFO [uber-SubtaskRunner]
org.apache.hadoop.mapred.TaskAttemptListenerImpl: Done acknowledgement from
attempt_1410195300700_18702_m_000000_0
2014-09-09 05:28:27,105 INFO [uber-SubtaskRunner]
org.apache.hadoop.mapred.Task: Task
'attempt_1410195300700_18702_m_000000_0' done.
2014-09-09 05:28:27,107 INFO [AsyncDispatcher event handler]
org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl:
attempt_1410195300700_18702_m_000000_0 TaskAttempt Transitioned from
COMMIT_PENDING to SUCCESS_CONTAINER_CLEANUP
2014-09-09 05:28:27,107 INFO [uber-SubtaskRunner]
org.apache.hadoop.mapred.LocalContainerLauncher: Processing the event
EventType: CONTAINER_REMOTE_CLEANUP for container
container_1410195300700_18702_01_000001 taskAttempt
attempt_1410195300700_18702_m_000000_0
2014-09-09 05:28:27,111 INFO [AsyncDispatcher event handler]
org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl:
attempt_1410195300700_18702_m_000000_0 TaskAttempt Transitioned from
SUCCESS_CONTAINER_CLEANUP to SUCCEEDED
2014-09-09 05:28:27,124 INFO [AsyncDispatcher event handler]
org.apache.hadoop.mapreduce.v2.app.job.impl.TaskImpl: Task succeeded with
attempt attempt_1410195300700_18702_m_000000_0
2014-09-09 05:28:27,126 INFO [AsyncDispatcher event handler]
org.apache.hadoop.mapreduce.v2.app.job.impl.TaskImpl:
task_1410195300700_18702_m_000000 Task Transitioned from RUNNING to
SUCCEEDED


On Tue, Sep 9, 2014 at 10:58 AM, Arun Murthy <ac...@hortonworks.com> wrote:

> Thanks for digging into this. Mind opening a jira to discuss further? Much
> appreciated.
>
> Arun
>
> On Mon, Sep 8, 2014 at 7:15 PM, Anfernee Xu <an...@gmail.com> wrote:
>
>> It turned out that it's not a configuration issue, some worker thread
>> which submits job to Yarn was blocked, see below thread dump
>>
>> "pool-1-thread-160" id=194 idx=0x30c tid=886 prio=5 alive, blocked,
>> native_blocked
>>     -- Blocked trying to get lock:
>> org/apache/hadoop/ipc/Client$Connection@0x1059d0c60[thin lock]
>>     at __lll_lock_wait+36(:0)@0x340260d594
>>     at tsSleep+399(threadsystem.c:83)@0x2b2356e5da80
>>     at jrockit/vm/Threads.sleep(I)V(Native Method)
>>     at jrockit/vm/Locks.waitForThinRelease(Locks.java:955)[optimized]
>>     at
>> jrockit/vm/Locks.monitorEnterSecondStageHard(Locks.java:1083)[optimized]
>>     at
>> jrockit/vm/Locks.monitorEnterSecondStage(Locks.java:1005)[optimized]
>>     at
>> org/apache/hadoop/ipc/Client$Connection.addCall(Client.java:400)[inlined]
>>     at
>> org/apache/hadoop/ipc/Client$Connection.access$2500(Client.java:314)[inlined]
>>     at
>> org/apache/hadoop/ipc/Client.getConnection(Client.java:1393)[optimized]
>>     at org/apache/hadoop/ipc/Client.call(Client.java:1318)[inlined]
>>     at org/apache/hadoop/ipc/Client.call(Client.java:1300)[inlined]
>>     at
>> org/apache/hadoop/ipc/ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206)[optimized]
>>     at
>> $Proxy21.getJobReport(Lcom/google/protobuf/RpcController;Lorg/apache/hadoop/mapreduce/v2/proto/MRServiceProtos$GetJobReportRequestProto;)Lorg/apache/hadoop/mapreduce/v2/proto/MRServiceProtos$GetJobReportResponseProto;(Unknown
>> Source)
>>     at
>> org/apache/hadoop/mapreduce/v2/api/impl/pb/client/MRClientProtocolPBClientImpl.getJobReport(MRClientProtocolPBClientImpl.java:133)[optimized]
>>     at
>> sun/reflect/GeneratedMethodAccessor79.invoke(Ljava/lang/Object;[Ljava/lang/Object;)Ljava/lang/Object;(Unknown
>> Source)[optimized]
>>     at
>> sun/reflect/DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)[optimized]
>>     at java/lang/reflect/Method.invoke(Method.java:597)[inlined]
>>     at
>> org/apache/hadoop/mapred/ClientServiceDelegate.invoke(ClientServiceDelegate.java:317)[inlined]
>>     at
>> org/apache/hadoop/mapred/ClientServiceDelegate.getJobStatus(ClientServiceDelegate.java:416)[optimized]
>>     ^-- Holding lock:
>> org/apache/hadoop/mapred/ClientServiceDelegate@0x10087d788[biased lock]
>>     at
>> org/apache/hadoop/mapred/YarnRunner.getJobStatus(TIEYarnRunner.java:522)[optimized]
>>     at org/apache/hadoop/mapreduce/Job$1.run(Job.java:314)[inlined]
>>     at org/apache/hadoop/mapreduce/Job$1.run(Job.java:311)[inlined]
>>     at
>> jrockit/vm/AccessController.doPrivileged(AccessController.java:254)[inlined]
>>     at javax/security/auth/Subject.doAs(Subject.java:396)[inlined]
>>     at
>> org/apache/hadoop/security/UserGroupInformation.doAs(UserGroupInformation.java:1491)[inlined]
>>     at
>> org/apache/hadoop/mapreduce/Job.updateStatus(Job.java:311)[optimized]
>>     ^-- Holding lock: org/apache/hadoop/mapreduce/Job@0x100522fb8[biased
>> lock]
>>     at org/apache/hadoop/mapreduce/Job.isComplete(Job.java:599)
>>     at org/apache/hadoop/mapreduce/Job.waitForCompletion(Job.java:1294)
>>
>> The lock was held by
>>
>> "pool-1-thread-10" id=44 idx=0xb4 tid=736 prio=5 alive, sleeping,
>> native_waiting
>>     at pthread_cond_timedwait@@GLIBC_2.3.2+288(:0)@0x340260b1c0
>>     at eventTimedWaitNoTransitionImpl+46(event.c:93)@0x2b2356cc741f
>>     at
>> syncWaitForSignalNoTransition+133(synchronization.c:51)@0x2b2356e5a096
>>     at syncWaitForSignal+189(synchronization.c:85)@0x2b2356e5a1ae
>>     at vmtSleep+165(signaling.c:197)@0x2b2356e35ef6
>>     at JVM_Sleep+188(jvmthreads.c:119)@0x2b2356d6bb7d
>>     at java/lang/Thread.sleep(J)V(Native Method)
>>     at
>> org/apache/hadoop/ipc/Client$Connection.handleConnectionFailure(Client.java:778)[optimized]
>>     at
>> org/apache/hadoop/ipc/Client$Connection.setupConnection(Client.java:566)[optimized]
>>     ^-- Holding lock: org/apache/hadoop/ipc/Client$Connection@0x1059d0c60
>> [recursive]
>>     at
>> org/apache/hadoop/ipc/Client$Connection.setupIOstreams(Client.java:642)[optimized]
>>     ^-- Holding lock: org/apache/hadoop/ipc/Client$Connection@0x1059d0c60[thin
>> lock]
>>     at
>> org/apache/hadoop/ipc/Client$Connection.access$2600(Client.java:314)[inlined]
>>     at
>> org/apache/hadoop/ipc/Client.getConnection(Client.java:1399)[optimized]
>>     at org/apache/hadoop/ipc/Client.call(Client.java:1318)[inlined]
>>     at org/apache/hadoop/ipc/Client.call(Client.java:1300)[inlined]
>>     at
>> org/apache/hadoop/ipc/ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206)[optimized]
>>     at
>> $Proxy21.getJobReport(Lcom/google/protobuf/RpcController;Lorg/apache/hadoop/mapreduce/v2/proto/MRServiceProtos$GetJobReportRequestProto;)Lorg/apache/hadoop/mapreduce/v2/proto/MRServiceProtos$GetJobReportResponseProto;(Unknown
>> Source)
>>     at
>> org/apache/hadoop/mapreduce/v2/api/impl/pb/client/MRClientProtocolPBClientImpl.getJobReport(MRClientProtocolPBClientImpl.java:133)[optimized]
>>     at
>> sun/reflect/GeneratedMethodAccessor79.invoke(Ljava/lang/Object;[Ljava/lang/Object;)Ljava/lang/Object;(Unknown
>> Source)[optimized]
>>     at
>> sun/reflect/DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)[optimized]
>>     at java/lang/reflect/Method.invoke(Method.java:597)[inlined]
>>     at
>> org/apache/hadoop/mapred/ClientServiceDelegate.invoke(ClientServiceDelegate.java:317)[inlined]
>>     at
>> org/apache/hadoop/mapred/ClientServiceDelegate.getJobStatus(ClientServiceDelegate.java:416)[optimized]
>>     ^-- Holding lock:
>> org/apache/hadoop/mapred/ClientServiceDelegate@0x1012c34f8[biased lock]
>>     at
>> org/apache/hadoop/mapred/YarnRunner.getJobStatus(TIEYarnRunner.java:522)[optimized]
>>     at org/apache/hadoop/mapreduce/Job$1.run(Job.java:314)[inlined]
>>     at org/apache/hadoop/mapreduce/Job$1.run(Job.java:311)[inlined]
>>     at
>> jrockit/vm/AccessController.doPrivileged(AccessController.java:254)[inlined]
>>     at javax/security/auth/Subject.doAs(Subject.java:396)[inlined]
>>     at
>> org/apache/hadoop/security/UserGroupInformation.doAs(UserGroupInformation.java:1491)[inlined]
>>     at
>> org/apache/hadoop/mapreduce/Job.updateStatus(Job.java:311)[optimized]
>>     ^-- Holding lock: org/apache/hadoop/mapreduce/Job@0x1016e05a8[biased
>> lock]
>>     at org/apache/hadoop/mapreduce/Job.isComplete(Job.java:599)
>>     at org/apache/hadoop/mapreduce/Job.waitForCompletion(Job.java:1294)
>>
>> You can see the thead holding the lock is in sleep state and the calling
>> method is Connection.handleConnectionFailure(), so I checked the our log
>> file and realized the connection failure is about historyserver is not
>> available. In my case, I did not start historyserver at all, because it's
>> not needed(I disabled log-aggregation), so my question is why the job
>> client was still trying to talk to historyserver even log aggregation is
>> disabled.
>>
>> Thanks
>>
>>
>>
>> On Mon, Sep 8, 2014 at 3:57 AM, Arun Murthy <ac...@hortonworks.com> wrote:
>>
>>> How many nodes do you have in your cluster?
>>>
>>> Also, could you share the CapacityScheduler initialization logs for each
>>> queue, such as:
>>>
>>> 2014-08-14 15:14:23,835 INFO
>>> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler:
>>> Initialized queue: unfunded: capacity=0.5, absoluteCapacity=0.5,
>>> usedResources=<memory:0, vCores:0>, usedCapacity=0.0,
>>> absoluteUsedCapacity=0.0, numApps=0, numContainers=0
>>> 2014-08-14 15:14:23,840 INFO
>>> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue:
>>> Initializing default
>>> capacity = 0.5 [= (float) configuredCapacity / 100 ]
>>> asboluteCapacity = 0.5 [= parentAbsoluteCapacity * capacity ]
>>> maxCapacity = 1.0 [= configuredMaxCapacity ]
>>> absoluteMaxCapacity = 1.0 [= 1.0 maximumCapacity undefined,
>>> (parentAbsoluteMaxCapacity * maximumCapacity) / 100 otherwise ]
>>> userLimit = 100 [= configuredUserLimit ]
>>> userLimitFactor = 1.0 [= configuredUserLimitFactor ]
>>> maxApplications = 5000 [= configuredMaximumSystemApplicationsPerQueue or
>>> (int)(configuredMaximumSystemApplications * absoluteCapacity)]
>>> maxApplicationsPerUser = 5000 [= (int)(maxApplications * (userLimit /
>>> 100.0f) * userLimitFactor) ]
>>> maxActiveApplications = 1 [= max((int)ceil((clusterResourceMemory /
>>> minimumAllocation) * maxAMResourcePerQueuePercent * absoluteMaxCapacity),1)
>>> ]
>>> maxActiveAppsUsingAbsCap = 1 [= max((int)ceil((clusterResourceMemory /
>>> minimumAllocation) *maxAMResourcePercent * absoluteCapacity),1) ]
>>> maxActiveApplicationsPerUser = 1 [= max((int)(maxActiveApplications *
>>> (userLimit / 100.0f) * userLimitFactor),1) ]
>>> usedCapacity = 0.0 [= usedResourcesMemory / (clusterResourceMemory *
>>> absoluteCapacity)]
>>> absoluteUsedCapacity = 0.0 [= usedResourcesMemory /
>>> clusterResourceMemory]
>>> maxAMResourcePerQueuePercent = 0.1 [= configuredMaximumAMResourcePercent
>>> ]
>>> minimumAllocationFactor = 0.87506104 [= (float)(maximumAllocationMemory
>>> - minimumAllocationMemory) / maximumAllocationMemory ]
>>> numContainers = 0 [= currentNumContainers ]
>>> state = RUNNING [= configuredState ]
>>> acls = SUBMIT_APPLICATIONS: ADMINISTER_QUEUE:  [= configuredAcls ]
>>> nodeLocalityDelay = 0
>>>
>>>
>>> Then, look at values for maxActiveAppsUsingAbsCap &
>>> maxActiveApplicationsPerUser. That should help debugging.
>>>
>>> thanks,
>>> Arun
>>>
>>>
>>> On Sun, Sep 7, 2014 at 9:37 AM, Anfernee Xu <an...@gmail.com>
>>> wrote:
>>>
>>>> Hi,
>>>>
>>>> I'rm running my cluster at Hadoop 2.2.0,  and use CapacityScheduler.
>>>> And all my jobs are uberized and running among 2 queues, one queue takes
>>>> majority of capacity(90%), another take 10%. What I found is for small
>>>> queue, only one job is running for a given time, I tried twisting below
>>>> properties, but no luck so far, could you guys share some light on this?
>>>>
>>>>  <property>
>>>>     <name>yarn.scheduler.capacity.maximum-am-resource-percent</name>
>>>>     <value>1.0</value>
>>>>     <description>
>>>>       Maximum percent of resources in the cluster which can be used to
>>>> run
>>>>       application masters i.e. controls number of concurrent running
>>>>       applications.
>>>>     </description>
>>>>   </property>
>>>>
>>>>
>>>>  <property>
>>>>     <name>yarn.scheduler.capacity.root.queues</name>
>>>>     <value>default,small</value>
>>>>     <description>
>>>>       The queues at the this level (root is the root queue).
>>>>     </description>
>>>>   </property>
>>>>
>>>>  <property>
>>>>
>>>> <name>yarn.scheduler.capacity.root.small.maximum-am-resource-percent</name>
>>>>     <value>1.0</value>
>>>>   </property>
>>>>
>>>>
>>>>  <property>
>>>>     <name>yarn.scheduler.capacity.root.small.user-limit</name>
>>>>     <value>1</value>
>>>>   </property>
>>>>
>>>>  <property>
>>>>     <name>yarn.scheduler.capacity.root.default.capacity</name>
>>>>     <value>88</value>
>>>>     <description>Default queue target capacity.</description>
>>>>   </property>
>>>>
>>>>
>>>>   <property>
>>>>     <name>yarn.scheduler.capacity.root.small.capacity</name>
>>>>     <value>12</value>
>>>>     <description>Default queue target capacity.</description>
>>>>   </property>
>>>>
>>>>  <property>
>>>>     <name>yarn.scheduler.capacity.root.default.maximum-capacity</name>
>>>>     <value>88</value>
>>>>     <description>
>>>>       The maximum capacity of the default queue.
>>>>     </description>
>>>>   </property>
>>>>
>>>>   <property>
>>>>     <name>yarn.scheduler.capacity.root.small.maximum-capacity</name>
>>>>     <value>12</value>
>>>>     <description>Maximum queue capacity.</description>
>>>>   </property>
>>>>
>>>>
>>>> Thanks
>>>>
>>>> --
>>>> --Anfernee
>>>>
>>>
>>>
>>>
>>> --
>>>
>>> --
>>> Arun C. Murthy
>>> Hortonworks Inc.
>>> http://hortonworks.com/
>>>
>>> CONFIDENTIALITY NOTICE
>>> NOTICE: This message is intended for the use of the individual or entity
>>> to which it is addressed and may contain information that is confidential,
>>> privileged and exempt from disclosure under applicable law. If the reader
>>> of this message is not the intended recipient, you are hereby notified that
>>> any printing, copying, dissemination, distribution, disclosure or
>>> forwarding of this communication is strictly prohibited. If you have
>>> received this communication in error, please contact the sender immediately
>>> and delete it from your system. Thank You.
>>
>>
>>
>>
>> --
>> --Anfernee
>>
>
>
>
> --
>
> --
> Arun C. Murthy
> Hortonworks Inc.
> http://hortonworks.com/
>
> CONFIDENTIALITY NOTICE
> NOTICE: This message is intended for the use of the individual or entity
> to which it is addressed and may contain information that is confidential,
> privileged and exempt from disclosure under applicable law. If the reader
> of this message is not the intended recipient, you are hereby notified that
> any printing, copying, dissemination, distribution, disclosure or
> forwarding of this communication is strictly prohibited. If you have
> received this communication in error, please contact the sender immediately
> and delete it from your system. Thank You.
>



-- 
--Anfernee

Re: In Yarn how to increase the number of concurrent applications for a queue

Posted by Arun Murthy <ac...@hortonworks.com>.

Thanks for digging into this. Mind opening a jira to discuss further? Much
appreciated.

Arun

On Mon, Sep 8, 2014 at 7:15 PM, Anfernee Xu <an...@gmail.com> wrote:

> It turned out that it's not a configuration issue, some worker thread
> which submits job to Yarn was blocked, see below thread dump
>
> "pool-1-thread-160" id=194 idx=0x30c tid=886 prio=5 alive, blocked,
> native_blocked
>     -- Blocked trying to get lock:
> org/apache/hadoop/ipc/Client$Connection@0x1059d0c60[thin lock]
>     at __lll_lock_wait+36(:0)@0x340260d594
>     at tsSleep+399(threadsystem.c:83)@0x2b2356e5da80
>     at jrockit/vm/Threads.sleep(I)V(Native Method)
>     at jrockit/vm/Locks.waitForThinRelease(Locks.java:955)[optimized]
>     at
> jrockit/vm/Locks.monitorEnterSecondStageHard(Locks.java:1083)[optimized]
>     at jrockit/vm/Locks.monitorEnterSecondStage(Locks.java:1005)[optimized]
>     at
> org/apache/hadoop/ipc/Client$Connection.addCall(Client.java:400)[inlined]
>     at
> org/apache/hadoop/ipc/Client$Connection.access$2500(Client.java:314)[inlined]
>     at
> org/apache/hadoop/ipc/Client.getConnection(Client.java:1393)[optimized]
>     at org/apache/hadoop/ipc/Client.call(Client.java:1318)[inlined]
>     at org/apache/hadoop/ipc/Client.call(Client.java:1300)[inlined]
>     at
> org/apache/hadoop/ipc/ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206)[optimized]
>     at
> $Proxy21.getJobReport(Lcom/google/protobuf/RpcController;Lorg/apache/hadoop/mapreduce/v2/proto/MRServiceProtos$GetJobReportRequestProto;)Lorg/apache/hadoop/mapreduce/v2/proto/MRServiceProtos$GetJobReportResponseProto;(Unknown
> Source)
>     at
> org/apache/hadoop/mapreduce/v2/api/impl/pb/client/MRClientProtocolPBClientImpl.getJobReport(MRClientProtocolPBClientImpl.java:133)[optimized]
>     at
> sun/reflect/GeneratedMethodAccessor79.invoke(Ljava/lang/Object;[Ljava/lang/Object;)Ljava/lang/Object;(Unknown
> Source)[optimized]
>     at
> sun/reflect/DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)[optimized]
>     at java/lang/reflect/Method.invoke(Method.java:597)[inlined]
>     at
> org/apache/hadoop/mapred/ClientServiceDelegate.invoke(ClientServiceDelegate.java:317)[inlined]
>     at
> org/apache/hadoop/mapred/ClientServiceDelegate.getJobStatus(ClientServiceDelegate.java:416)[optimized]
>     ^-- Holding lock:
> org/apache/hadoop/mapred/ClientServiceDelegate@0x10087d788[biased lock]
>     at
> org/apache/hadoop/mapred/YarnRunner.getJobStatus(TIEYarnRunner.java:522)[optimized]
>     at org/apache/hadoop/mapreduce/Job$1.run(Job.java:314)[inlined]
>     at org/apache/hadoop/mapreduce/Job$1.run(Job.java:311)[inlined]
>     at
> jrockit/vm/AccessController.doPrivileged(AccessController.java:254)[inlined]
>     at javax/security/auth/Subject.doAs(Subject.java:396)[inlined]
>     at
> org/apache/hadoop/security/UserGroupInformation.doAs(UserGroupInformation.java:1491)[inlined]
>     at
> org/apache/hadoop/mapreduce/Job.updateStatus(Job.java:311)[optimized]
>     ^-- Holding lock: org/apache/hadoop/mapreduce/Job@0x100522fb8[biased
> lock]
>     at org/apache/hadoop/mapreduce/Job.isComplete(Job.java:599)
>     at org/apache/hadoop/mapreduce/Job.waitForCompletion(Job.java:1294)
>
> The lock was held by
>
> "pool-1-thread-10" id=44 idx=0xb4 tid=736 prio=5 alive, sleeping,
> native_waiting
>     at pthread_cond_timedwait@@GLIBC_2.3.2+288(:0)@0x340260b1c0
>     at eventTimedWaitNoTransitionImpl+46(event.c:93)@0x2b2356cc741f
>     at
> syncWaitForSignalNoTransition+133(synchronization.c:51)@0x2b2356e5a096
>     at syncWaitForSignal+189(synchronization.c:85)@0x2b2356e5a1ae
>     at vmtSleep+165(signaling.c:197)@0x2b2356e35ef6
>     at JVM_Sleep+188(jvmthreads.c:119)@0x2b2356d6bb7d
>     at java/lang/Thread.sleep(J)V(Native Method)
>     at
> org/apache/hadoop/ipc/Client$Connection.handleConnectionFailure(Client.java:778)[optimized]
>     at
> org/apache/hadoop/ipc/Client$Connection.setupConnection(Client.java:566)[optimized]
>     ^-- Holding lock: org/apache/hadoop/ipc/Client$Connection@0x1059d0c60
> [recursive]
>     at
> org/apache/hadoop/ipc/Client$Connection.setupIOstreams(Client.java:642)[optimized]
>     ^-- Holding lock: org/apache/hadoop/ipc/Client$Connection@0x1059d0c60[thin
> lock]
>     at
> org/apache/hadoop/ipc/Client$Connection.access$2600(Client.java:314)[inlined]
>     at
> org/apache/hadoop/ipc/Client.getConnection(Client.java:1399)[optimized]
>     at org/apache/hadoop/ipc/Client.call(Client.java:1318)[inlined]
>     at org/apache/hadoop/ipc/Client.call(Client.java:1300)[inlined]
>     at
> org/apache/hadoop/ipc/ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206)[optimized]
>     at
> $Proxy21.getJobReport(Lcom/google/protobuf/RpcController;Lorg/apache/hadoop/mapreduce/v2/proto/MRServiceProtos$GetJobReportRequestProto;)Lorg/apache/hadoop/mapreduce/v2/proto/MRServiceProtos$GetJobReportResponseProto;(Unknown
> Source)
>     at
> org/apache/hadoop/mapreduce/v2/api/impl/pb/client/MRClientProtocolPBClientImpl.getJobReport(MRClientProtocolPBClientImpl.java:133)[optimized]
>     at
> sun/reflect/GeneratedMethodAccessor79.invoke(Ljava/lang/Object;[Ljava/lang/Object;)Ljava/lang/Object;(Unknown
> Source)[optimized]
>     at
> sun/reflect/DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)[optimized]
>     at java/lang/reflect/Method.invoke(Method.java:597)[inlined]
>     at
> org/apache/hadoop/mapred/ClientServiceDelegate.invoke(ClientServiceDelegate.java:317)[inlined]
>     at
> org/apache/hadoop/mapred/ClientServiceDelegate.getJobStatus(ClientServiceDelegate.java:416)[optimized]
>     ^-- Holding lock:
> org/apache/hadoop/mapred/ClientServiceDelegate@0x1012c34f8[biased lock]
>     at
> org/apache/hadoop/mapred/YarnRunner.getJobStatus(TIEYarnRunner.java:522)[optimized]
>     at org/apache/hadoop/mapreduce/Job$1.run(Job.java:314)[inlined]
>     at org/apache/hadoop/mapreduce/Job$1.run(Job.java:311)[inlined]
>     at
> jrockit/vm/AccessController.doPrivileged(AccessController.java:254)[inlined]
>     at javax/security/auth/Subject.doAs(Subject.java:396)[inlined]
>     at
> org/apache/hadoop/security/UserGroupInformation.doAs(UserGroupInformation.java:1491)[inlined]
>     at
> org/apache/hadoop/mapreduce/Job.updateStatus(Job.java:311)[optimized]
>     ^-- Holding lock: org/apache/hadoop/mapreduce/Job@0x1016e05a8[biased
> lock]
>     at org/apache/hadoop/mapreduce/Job.isComplete(Job.java:599)
>     at org/apache/hadoop/mapreduce/Job.waitForCompletion(Job.java:1294)
>
> You can see the thead holding the lock is in sleep state and the calling
> method is Connection.handleConnectionFailure(), so I checked the our log
> file and realized the connection failure is about historyserver is not
> available. In my case, I did not start historyserver at all, because it's
> not needed(I disabled log-aggregation), so my question is why the job
> client was still trying to talk to historyserver even log aggregation is
> disabled.
>
> Thanks
>
>
>
> On Mon, Sep 8, 2014 at 3:57 AM, Arun Murthy <ac...@hortonworks.com> wrote:
>
>> How many nodes do you have in your cluster?
>>
>> Also, could you share the CapacityScheduler initialization logs for each
>> queue, such as:
>>
>> 2014-08-14 15:14:23,835 INFO
>> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler:
>> Initialized queue: unfunded: capacity=0.5, absoluteCapacity=0.5,
>> usedResources=<memory:0, vCores:0>, usedCapacity=0.0,
>> absoluteUsedCapacity=0.0, numApps=0, numContainers=0
>> 2014-08-14 15:14:23,840 INFO
>> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue:
>> Initializing default
>> capacity = 0.5 [= (float) configuredCapacity / 100 ]
>> asboluteCapacity = 0.5 [= parentAbsoluteCapacity * capacity ]
>> maxCapacity = 1.0 [= configuredMaxCapacity ]
>> absoluteMaxCapacity = 1.0 [= 1.0 maximumCapacity undefined,
>> (parentAbsoluteMaxCapacity * maximumCapacity) / 100 otherwise ]
>> userLimit = 100 [= configuredUserLimit ]
>> userLimitFactor = 1.0 [= configuredUserLimitFactor ]
>> maxApplications = 5000 [= configuredMaximumSystemApplicationsPerQueue or
>> (int)(configuredMaximumSystemApplications * absoluteCapacity)]
>> maxApplicationsPerUser = 5000 [= (int)(maxApplications * (userLimit /
>> 100.0f) * userLimitFactor) ]
>> maxActiveApplications = 1 [= max((int)ceil((clusterResourceMemory /
>> minimumAllocation) * maxAMResourcePerQueuePercent * absoluteMaxCapacity),1)
>> ]
>> maxActiveAppsUsingAbsCap = 1 [= max((int)ceil((clusterResourceMemory /
>> minimumAllocation) *maxAMResourcePercent * absoluteCapacity),1) ]
>> maxActiveApplicationsPerUser = 1 [= max((int)(maxActiveApplications *
>> (userLimit / 100.0f) * userLimitFactor),1) ]
>> usedCapacity = 0.0 [= usedResourcesMemory / (clusterResourceMemory *
>> absoluteCapacity)]
>> absoluteUsedCapacity = 0.0 [= usedResourcesMemory / clusterResourceMemory]
>> maxAMResourcePerQueuePercent = 0.1 [= configuredMaximumAMResourcePercent ]
>> minimumAllocationFactor = 0.87506104 [= (float)(maximumAllocationMemory -
>> minimumAllocationMemory) / maximumAllocationMemory ]
>> numContainers = 0 [= currentNumContainers ]
>> state = RUNNING [= configuredState ]
>> acls = SUBMIT_APPLICATIONS: ADMINISTER_QUEUE:  [= configuredAcls ]
>> nodeLocalityDelay = 0
>>
>>
>> Then, look at values for maxActiveAppsUsingAbsCap &
>> maxActiveApplicationsPerUser. That should help debugging.
>>
>> thanks,
>> Arun
>>
>>
>> On Sun, Sep 7, 2014 at 9:37 AM, Anfernee Xu <an...@gmail.com>
>> wrote:
>>
>>> Hi,
>>>
>>> I'rm running my cluster at Hadoop 2.2.0,  and use CapacityScheduler. And
>>> all my jobs are uberized and running among 2 queues, one queue takes
>>> majority of capacity(90%), another take 10%. What I found is for small
>>> queue, only one job is running for a given time, I tried twisting below
>>> properties, but no luck so far, could you guys share some light on this?
>>>
>>>  <property>
>>>     <name>yarn.scheduler.capacity.maximum-am-resource-percent</name>
>>>     <value>1.0</value>
>>>     <description>
>>>       Maximum percent of resources in the cluster which can be used to
>>> run
>>>       application masters i.e. controls number of concurrent running
>>>       applications.
>>>     </description>
>>>   </property>
>>>
>>>
>>>  <property>
>>>     <name>yarn.scheduler.capacity.root.queues</name>
>>>     <value>default,small</value>
>>>     <description>
>>>       The queues at the this level (root is the root queue).
>>>     </description>
>>>   </property>
>>>
>>>  <property>
>>>
>>> <name>yarn.scheduler.capacity.root.small.maximum-am-resource-percent</name>
>>>     <value>1.0</value>
>>>   </property>
>>>
>>>
>>>  <property>
>>>     <name>yarn.scheduler.capacity.root.small.user-limit</name>
>>>     <value>1</value>
>>>   </property>
>>>
>>>  <property>
>>>     <name>yarn.scheduler.capacity.root.default.capacity</name>
>>>     <value>88</value>
>>>     <description>Default queue target capacity.</description>
>>>   </property>
>>>
>>>
>>>   <property>
>>>     <name>yarn.scheduler.capacity.root.small.capacity</name>
>>>     <value>12</value>
>>>     <description>Default queue target capacity.</description>
>>>   </property>
>>>
>>>  <property>
>>>     <name>yarn.scheduler.capacity.root.default.maximum-capacity</name>
>>>     <value>88</value>
>>>     <description>
>>>       The maximum capacity of the default queue.
>>>     </description>
>>>   </property>
>>>
>>>   <property>
>>>     <name>yarn.scheduler.capacity.root.small.maximum-capacity</name>
>>>     <value>12</value>
>>>     <description>Maximum queue capacity.</description>
>>>   </property>
>>>
>>>
>>> Thanks
>>>
>>> --
>>> --Anfernee
>>>
>>
>>
>>
>> --
>>
>> --
>> Arun C. Murthy
>> Hortonworks Inc.
>> http://hortonworks.com/
>>
>> CONFIDENTIALITY NOTICE
>> NOTICE: This message is intended for the use of the individual or entity
>> to which it is addressed and may contain information that is confidential,
>> privileged and exempt from disclosure under applicable law. If the reader
>> of this message is not the intended recipient, you are hereby notified that
>> any printing, copying, dissemination, distribution, disclosure or
>> forwarding of this communication is strictly prohibited. If you have
>> received this communication in error, please contact the sender immediately
>> and delete it from your system. Thank You.
>
>
>
>
> --
> --Anfernee
>



-- 

--
Arun C. Murthy
Hortonworks Inc.
http://hortonworks.com/

-- 
CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to 
which it is addressed and may contain information that is confidential, 
privileged and exempt from disclosure under applicable law. If the reader 
of this message is not the intended recipient, you are hereby notified that 
any printing, copying, dissemination, distribution, disclosure or 
forwarding of this communication is strictly prohibited. If you have 
received this communication in error, please contact the sender immediately 
and delete it from your system. Thank You.

Re: In Yarn how to increase the number of concurrent applications for a queue

Posted by Arun Murthy <ac...@hortonworks.com>.

Thanks for digging into this. Mind opening a jira to discuss further? Much
appreciated.

Arun

On Mon, Sep 8, 2014 at 7:15 PM, Anfernee Xu <an...@gmail.com> wrote:

> It turned out that it's not a configuration issue, some worker thread
> which submits job to Yarn was blocked, see below thread dump
>
> "pool-1-thread-160" id=194 idx=0x30c tid=886 prio=5 alive, blocked,
> native_blocked
>     -- Blocked trying to get lock:
> org/apache/hadoop/ipc/Client$Connection@0x1059d0c60[thin lock]
>     at __lll_lock_wait+36(:0)@0x340260d594
>     at tsSleep+399(threadsystem.c:83)@0x2b2356e5da80
>     at jrockit/vm/Threads.sleep(I)V(Native Method)
>     at jrockit/vm/Locks.waitForThinRelease(Locks.java:955)[optimized]
>     at
> jrockit/vm/Locks.monitorEnterSecondStageHard(Locks.java:1083)[optimized]
>     at jrockit/vm/Locks.monitorEnterSecondStage(Locks.java:1005)[optimized]
>     at
> org/apache/hadoop/ipc/Client$Connection.addCall(Client.java:400)[inlined]
>     at
> org/apache/hadoop/ipc/Client$Connection.access$2500(Client.java:314)[inlined]
>     at
> org/apache/hadoop/ipc/Client.getConnection(Client.java:1393)[optimized]
>     at org/apache/hadoop/ipc/Client.call(Client.java:1318)[inlined]
>     at org/apache/hadoop/ipc/Client.call(Client.java:1300)[inlined]
>     at
> org/apache/hadoop/ipc/ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206)[optimized]
>     at
> $Proxy21.getJobReport(Lcom/google/protobuf/RpcController;Lorg/apache/hadoop/mapreduce/v2/proto/MRServiceProtos$GetJobReportRequestProto;)Lorg/apache/hadoop/mapreduce/v2/proto/MRServiceProtos$GetJobReportResponseProto;(Unknown
> Source)
>     at
> org/apache/hadoop/mapreduce/v2/api/impl/pb/client/MRClientProtocolPBClientImpl.getJobReport(MRClientProtocolPBClientImpl.java:133)[optimized]
>     at
> sun/reflect/GeneratedMethodAccessor79.invoke(Ljava/lang/Object;[Ljava/lang/Object;)Ljava/lang/Object;(Unknown
> Source)[optimized]
>     at
> sun/reflect/DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)[optimized]
>     at java/lang/reflect/Method.invoke(Method.java:597)[inlined]
>     at
> org/apache/hadoop/mapred/ClientServiceDelegate.invoke(ClientServiceDelegate.java:317)[inlined]
>     at
> org/apache/hadoop/mapred/ClientServiceDelegate.getJobStatus(ClientServiceDelegate.java:416)[optimized]
>     ^-- Holding lock:
> org/apache/hadoop/mapred/ClientServiceDelegate@0x10087d788[biased lock]
>     at
> org/apache/hadoop/mapred/YarnRunner.getJobStatus(TIEYarnRunner.java:522)[optimized]
>     at org/apache/hadoop/mapreduce/Job$1.run(Job.java:314)[inlined]
>     at org/apache/hadoop/mapreduce/Job$1.run(Job.java:311)[inlined]
>     at
> jrockit/vm/AccessController.doPrivileged(AccessController.java:254)[inlined]
>     at javax/security/auth/Subject.doAs(Subject.java:396)[inlined]
>     at
> org/apache/hadoop/security/UserGroupInformation.doAs(UserGroupInformation.java:1491)[inlined]
>     at
> org/apache/hadoop/mapreduce/Job.updateStatus(Job.java:311)[optimized]
>     ^-- Holding lock: org/apache/hadoop/mapreduce/Job@0x100522fb8[biased
> lock]
>     at org/apache/hadoop/mapreduce/Job.isComplete(Job.java:599)
>     at org/apache/hadoop/mapreduce/Job.waitForCompletion(Job.java:1294)
>
> The lock was held by
>
> "pool-1-thread-10" id=44 idx=0xb4 tid=736 prio=5 alive, sleeping,
> native_waiting
>     at pthread_cond_timedwait@@GLIBC_2.3.2+288(:0)@0x340260b1c0
>     at eventTimedWaitNoTransitionImpl+46(event.c:93)@0x2b2356cc741f
>     at
> syncWaitForSignalNoTransition+133(synchronization.c:51)@0x2b2356e5a096
>     at syncWaitForSignal+189(synchronization.c:85)@0x2b2356e5a1ae
>     at vmtSleep+165(signaling.c:197)@0x2b2356e35ef6
>     at JVM_Sleep+188(jvmthreads.c:119)@0x2b2356d6bb7d
>     at java/lang/Thread.sleep(J)V(Native Method)
>     at
> org/apache/hadoop/ipc/Client$Connection.handleConnectionFailure(Client.java:778)[optimized]
>     at
> org/apache/hadoop/ipc/Client$Connection.setupConnection(Client.java:566)[optimized]
>     ^-- Holding lock: org/apache/hadoop/ipc/Client$Connection@0x1059d0c60
> [recursive]
>     at
> org/apache/hadoop/ipc/Client$Connection.setupIOstreams(Client.java:642)[optimized]
>     ^-- Holding lock: org/apache/hadoop/ipc/Client$Connection@0x1059d0c60[thin
> lock]
>     at
> org/apache/hadoop/ipc/Client$Connection.access$2600(Client.java:314)[inlined]
>     at
> org/apache/hadoop/ipc/Client.getConnection(Client.java:1399)[optimized]
>     at org/apache/hadoop/ipc/Client.call(Client.java:1318)[inlined]
>     at org/apache/hadoop/ipc/Client.call(Client.java:1300)[inlined]
>     at
> org/apache/hadoop/ipc/ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206)[optimized]
>     at
> $Proxy21.getJobReport(Lcom/google/protobuf/RpcController;Lorg/apache/hadoop/mapreduce/v2/proto/MRServiceProtos$GetJobReportRequestProto;)Lorg/apache/hadoop/mapreduce/v2/proto/MRServiceProtos$GetJobReportResponseProto;(Unknown
> Source)
>     at
> org/apache/hadoop/mapreduce/v2/api/impl/pb/client/MRClientProtocolPBClientImpl.getJobReport(MRClientProtocolPBClientImpl.java:133)[optimized]
>     at
> sun/reflect/GeneratedMethodAccessor79.invoke(Ljava/lang/Object;[Ljava/lang/Object;)Ljava/lang/Object;(Unknown
> Source)[optimized]
>     at
> sun/reflect/DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)[optimized]
>     at java/lang/reflect/Method.invoke(Method.java:597)[inlined]
>     at
> org/apache/hadoop/mapred/ClientServiceDelegate.invoke(ClientServiceDelegate.java:317)[inlined]
>     at
> org/apache/hadoop/mapred/ClientServiceDelegate.getJobStatus(ClientServiceDelegate.java:416)[optimized]
>     ^-- Holding lock:
> org/apache/hadoop/mapred/ClientServiceDelegate@0x1012c34f8[biased lock]
>     at
> org/apache/hadoop/mapred/YarnRunner.getJobStatus(TIEYarnRunner.java:522)[optimized]
>     at org/apache/hadoop/mapreduce/Job$1.run(Job.java:314)[inlined]
>     at org/apache/hadoop/mapreduce/Job$1.run(Job.java:311)[inlined]
>     at
> jrockit/vm/AccessController.doPrivileged(AccessController.java:254)[inlined]
>     at javax/security/auth/Subject.doAs(Subject.java:396)[inlined]
>     at
> org/apache/hadoop/security/UserGroupInformation.doAs(UserGroupInformation.java:1491)[inlined]
>     at
> org/apache/hadoop/mapreduce/Job.updateStatus(Job.java:311)[optimized]
>     ^-- Holding lock: org/apache/hadoop/mapreduce/Job@0x1016e05a8[biased
> lock]
>     at org/apache/hadoop/mapreduce/Job.isComplete(Job.java:599)
>     at org/apache/hadoop/mapreduce/Job.waitForCompletion(Job.java:1294)
>
> You can see the thead holding the lock is in sleep state and the calling
> method is Connection.handleConnectionFailure(), so I checked the our log
> file and realized the connection failure is about historyserver is not
> available. In my case, I did not start historyserver at all, because it's
> not needed(I disabled log-aggregation), so my question is why the job
> client was still trying to talk to historyserver even log aggregation is
> disabled.
>
> Thanks
>
>
>
> On Mon, Sep 8, 2014 at 3:57 AM, Arun Murthy <ac...@hortonworks.com> wrote:
>
>> How many nodes do you have in your cluster?
>>
>> Also, could you share the CapacityScheduler initialization logs for each
>> queue, such as:
>>
>> 2014-08-14 15:14:23,835 INFO
>> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler:
>> Initialized queue: unfunded: capacity=0.5, absoluteCapacity=0.5,
>> usedResources=<memory:0, vCores:0>, usedCapacity=0.0,
>> absoluteUsedCapacity=0.0, numApps=0, numContainers=0
>> 2014-08-14 15:14:23,840 INFO
>> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue:
>> Initializing default
>> capacity = 0.5 [= (float) configuredCapacity / 100 ]
>> asboluteCapacity = 0.5 [= parentAbsoluteCapacity * capacity ]
>> maxCapacity = 1.0 [= configuredMaxCapacity ]
>> absoluteMaxCapacity = 1.0 [= 1.0 maximumCapacity undefined,
>> (parentAbsoluteMaxCapacity * maximumCapacity) / 100 otherwise ]
>> userLimit = 100 [= configuredUserLimit ]
>> userLimitFactor = 1.0 [= configuredUserLimitFactor ]
>> maxApplications = 5000 [= configuredMaximumSystemApplicationsPerQueue or
>> (int)(configuredMaximumSystemApplications * absoluteCapacity)]
>> maxApplicationsPerUser = 5000 [= (int)(maxApplications * (userLimit /
>> 100.0f) * userLimitFactor) ]
>> maxActiveApplications = 1 [= max((int)ceil((clusterResourceMemory /
>> minimumAllocation) * maxAMResourcePerQueuePercent * absoluteMaxCapacity),1)
>> ]
>> maxActiveAppsUsingAbsCap = 1 [= max((int)ceil((clusterResourceMemory /
>> minimumAllocation) *maxAMResourcePercent * absoluteCapacity),1) ]
>> maxActiveApplicationsPerUser = 1 [= max((int)(maxActiveApplications *
>> (userLimit / 100.0f) * userLimitFactor),1) ]
>> usedCapacity = 0.0 [= usedResourcesMemory / (clusterResourceMemory *
>> absoluteCapacity)]
>> absoluteUsedCapacity = 0.0 [= usedResourcesMemory / clusterResourceMemory]
>> maxAMResourcePerQueuePercent = 0.1 [= configuredMaximumAMResourcePercent ]
>> minimumAllocationFactor = 0.87506104 [= (float)(maximumAllocationMemory -
>> minimumAllocationMemory) / maximumAllocationMemory ]
>> numContainers = 0 [= currentNumContainers ]
>> state = RUNNING [= configuredState ]
>> acls = SUBMIT_APPLICATIONS: ADMINISTER_QUEUE:  [= configuredAcls ]
>> nodeLocalityDelay = 0
>>
>>
>> Then, look at values for maxActiveAppsUsingAbsCap &
>> maxActiveApplicationsPerUser. That should help debugging.
>>
>> thanks,
>> Arun
>>
>>
>> On Sun, Sep 7, 2014 at 9:37 AM, Anfernee Xu <an...@gmail.com>
>> wrote:
>>
>>> Hi,
>>>
>>> I'rm running my cluster at Hadoop 2.2.0,  and use CapacityScheduler. And
>>> all my jobs are uberized and running among 2 queues, one queue takes
>>> majority of capacity(90%), another take 10%. What I found is for small
>>> queue, only one job is running for a given time, I tried twisting below
>>> properties, but no luck so far, could you guys share some light on this?
>>>
>>>  <property>
>>>     <name>yarn.scheduler.capacity.maximum-am-resource-percent</name>
>>>     <value>1.0</value>
>>>     <description>
>>>       Maximum percent of resources in the cluster which can be used to
>>> run
>>>       application masters i.e. controls number of concurrent running
>>>       applications.
>>>     </description>
>>>   </property>
>>>
>>>
>>>  <property>
>>>     <name>yarn.scheduler.capacity.root.queues</name>
>>>     <value>default,small</value>
>>>     <description>
>>>       The queues at the this level (root is the root queue).
>>>     </description>
>>>   </property>
>>>
>>>  <property>
>>>
>>> <name>yarn.scheduler.capacity.root.small.maximum-am-resource-percent</name>
>>>     <value>1.0</value>
>>>   </property>
>>>
>>>
>>>  <property>
>>>     <name>yarn.scheduler.capacity.root.small.user-limit</name>
>>>     <value>1</value>
>>>   </property>
>>>
>>>  <property>
>>>     <name>yarn.scheduler.capacity.root.default.capacity</name>
>>>     <value>88</value>
>>>     <description>Default queue target capacity.</description>
>>>   </property>
>>>
>>>
>>>   <property>
>>>     <name>yarn.scheduler.capacity.root.small.capacity</name>
>>>     <value>12</value>
>>>     <description>Default queue target capacity.</description>
>>>   </property>
>>>
>>>  <property>
>>>     <name>yarn.scheduler.capacity.root.default.maximum-capacity</name>
>>>     <value>88</value>
>>>     <description>
>>>       The maximum capacity of the default queue.
>>>     </description>
>>>   </property>
>>>
>>>   <property>
>>>     <name>yarn.scheduler.capacity.root.small.maximum-capacity</name>
>>>     <value>12</value>
>>>     <description>Maximum queue capacity.</description>
>>>   </property>
>>>
>>>
>>> Thanks
>>>
>>> --
>>> --Anfernee
>>>
>>
>>
>>
>> --
>>
>> --
>> Arun C. Murthy
>> Hortonworks Inc.
>> http://hortonworks.com/
>>
>> CONFIDENTIALITY NOTICE
>> NOTICE: This message is intended for the use of the individual or entity
>> to which it is addressed and may contain information that is confidential,
>> privileged and exempt from disclosure under applicable law. If the reader
>> of this message is not the intended recipient, you are hereby notified that
>> any printing, copying, dissemination, distribution, disclosure or
>> forwarding of this communication is strictly prohibited. If you have
>> received this communication in error, please contact the sender immediately
>> and delete it from your system. Thank You.
>
>
>
>
> --
> --Anfernee
>



-- 

--
Arun C. Murthy
Hortonworks Inc.
http://hortonworks.com/

-- 
CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to 
which it is addressed and may contain information that is confidential, 
privileged and exempt from disclosure under applicable law. If the reader 
of this message is not the intended recipient, you are hereby notified that 
any printing, copying, dissemination, distribution, disclosure or 
forwarding of this communication is strictly prohibited. If you have 
received this communication in error, please contact the sender immediately 
and delete it from your system. Thank You.

Re: In Yarn how to increase the number of concurrent applications for a queue

Posted by Arun Murthy <ac...@hortonworks.com>.

Thanks for digging into this. Mind opening a jira to discuss further? Much
appreciated.

Arun

On Mon, Sep 8, 2014 at 7:15 PM, Anfernee Xu <an...@gmail.com> wrote:

> It turned out that it's not a configuration issue, some worker thread
> which submits job to Yarn was blocked, see below thread dump
>
> "pool-1-thread-160" id=194 idx=0x30c tid=886 prio=5 alive, blocked,
> native_blocked
>     -- Blocked trying to get lock:
> org/apache/hadoop/ipc/Client$Connection@0x1059d0c60[thin lock]
>     at __lll_lock_wait+36(:0)@0x340260d594
>     at tsSleep+399(threadsystem.c:83)@0x2b2356e5da80
>     at jrockit/vm/Threads.sleep(I)V(Native Method)
>     at jrockit/vm/Locks.waitForThinRelease(Locks.java:955)[optimized]
>     at
> jrockit/vm/Locks.monitorEnterSecondStageHard(Locks.java:1083)[optimized]
>     at jrockit/vm/Locks.monitorEnterSecondStage(Locks.java:1005)[optimized]
>     at
> org/apache/hadoop/ipc/Client$Connection.addCall(Client.java:400)[inlined]
>     at
> org/apache/hadoop/ipc/Client$Connection.access$2500(Client.java:314)[inlined]
>     at
> org/apache/hadoop/ipc/Client.getConnection(Client.java:1393)[optimized]
>     at org/apache/hadoop/ipc/Client.call(Client.java:1318)[inlined]
>     at org/apache/hadoop/ipc/Client.call(Client.java:1300)[inlined]
>     at
> org/apache/hadoop/ipc/ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206)[optimized]
>     at
> $Proxy21.getJobReport(Lcom/google/protobuf/RpcController;Lorg/apache/hadoop/mapreduce/v2/proto/MRServiceProtos$GetJobReportRequestProto;)Lorg/apache/hadoop/mapreduce/v2/proto/MRServiceProtos$GetJobReportResponseProto;(Unknown
> Source)
>     at
> org/apache/hadoop/mapreduce/v2/api/impl/pb/client/MRClientProtocolPBClientImpl.getJobReport(MRClientProtocolPBClientImpl.java:133)[optimized]
>     at
> sun/reflect/GeneratedMethodAccessor79.invoke(Ljava/lang/Object;[Ljava/lang/Object;)Ljava/lang/Object;(Unknown
> Source)[optimized]
>     at
> sun/reflect/DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)[optimized]
>     at java/lang/reflect/Method.invoke(Method.java:597)[inlined]
>     at
> org/apache/hadoop/mapred/ClientServiceDelegate.invoke(ClientServiceDelegate.java:317)[inlined]
>     at
> org/apache/hadoop/mapred/ClientServiceDelegate.getJobStatus(ClientServiceDelegate.java:416)[optimized]
>     ^-- Holding lock:
> org/apache/hadoop/mapred/ClientServiceDelegate@0x10087d788[biased lock]
>     at
> org/apache/hadoop/mapred/YarnRunner.getJobStatus(TIEYarnRunner.java:522)[optimized]
>     at org/apache/hadoop/mapreduce/Job$1.run(Job.java:314)[inlined]
>     at org/apache/hadoop/mapreduce/Job$1.run(Job.java:311)[inlined]
>     at
> jrockit/vm/AccessController.doPrivileged(AccessController.java:254)[inlined]
>     at javax/security/auth/Subject.doAs(Subject.java:396)[inlined]
>     at
> org/apache/hadoop/security/UserGroupInformation.doAs(UserGroupInformation.java:1491)[inlined]
>     at
> org/apache/hadoop/mapreduce/Job.updateStatus(Job.java:311)[optimized]
>     ^-- Holding lock: org/apache/hadoop/mapreduce/Job@0x100522fb8[biased
> lock]
>     at org/apache/hadoop/mapreduce/Job.isComplete(Job.java:599)
>     at org/apache/hadoop/mapreduce/Job.waitForCompletion(Job.java:1294)
>
> The lock was held by
>
> "pool-1-thread-10" id=44 idx=0xb4 tid=736 prio=5 alive, sleeping,
> native_waiting
>     at pthread_cond_timedwait@@GLIBC_2.3.2+288(:0)@0x340260b1c0
>     at eventTimedWaitNoTransitionImpl+46(event.c:93)@0x2b2356cc741f
>     at
> syncWaitForSignalNoTransition+133(synchronization.c:51)@0x2b2356e5a096
>     at syncWaitForSignal+189(synchronization.c:85)@0x2b2356e5a1ae
>     at vmtSleep+165(signaling.c:197)@0x2b2356e35ef6
>     at JVM_Sleep+188(jvmthreads.c:119)@0x2b2356d6bb7d
>     at java/lang/Thread.sleep(J)V(Native Method)
>     at
> org/apache/hadoop/ipc/Client$Connection.handleConnectionFailure(Client.java:778)[optimized]
>     at
> org/apache/hadoop/ipc/Client$Connection.setupConnection(Client.java:566)[optimized]
>     ^-- Holding lock: org/apache/hadoop/ipc/Client$Connection@0x1059d0c60
> [recursive]
>     at
> org/apache/hadoop/ipc/Client$Connection.setupIOstreams(Client.java:642)[optimized]
>     ^-- Holding lock: org/apache/hadoop/ipc/Client$Connection@0x1059d0c60[thin
> lock]
>     at
> org/apache/hadoop/ipc/Client$Connection.access$2600(Client.java:314)[inlined]
>     at
> org/apache/hadoop/ipc/Client.getConnection(Client.java:1399)[optimized]
>     at org/apache/hadoop/ipc/Client.call(Client.java:1318)[inlined]
>     at org/apache/hadoop/ipc/Client.call(Client.java:1300)[inlined]
>     at
> org/apache/hadoop/ipc/ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206)[optimized]
>     at
> $Proxy21.getJobReport(Lcom/google/protobuf/RpcController;Lorg/apache/hadoop/mapreduce/v2/proto/MRServiceProtos$GetJobReportRequestProto;)Lorg/apache/hadoop/mapreduce/v2/proto/MRServiceProtos$GetJobReportResponseProto;(Unknown
> Source)
>     at
> org/apache/hadoop/mapreduce/v2/api/impl/pb/client/MRClientProtocolPBClientImpl.getJobReport(MRClientProtocolPBClientImpl.java:133)[optimized]
>     at
> sun/reflect/GeneratedMethodAccessor79.invoke(Ljava/lang/Object;[Ljava/lang/Object;)Ljava/lang/Object;(Unknown
> Source)[optimized]
>     at
> sun/reflect/DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)[optimized]
>     at java/lang/reflect/Method.invoke(Method.java:597)[inlined]
>     at
> org/apache/hadoop/mapred/ClientServiceDelegate.invoke(ClientServiceDelegate.java:317)[inlined]
>     at
> org/apache/hadoop/mapred/ClientServiceDelegate.getJobStatus(ClientServiceDelegate.java:416)[optimized]
>     ^-- Holding lock:
> org/apache/hadoop/mapred/ClientServiceDelegate@0x1012c34f8[biased lock]
>     at
> org/apache/hadoop/mapred/YarnRunner.getJobStatus(TIEYarnRunner.java:522)[optimized]
>     at org/apache/hadoop/mapreduce/Job$1.run(Job.java:314)[inlined]
>     at org/apache/hadoop/mapreduce/Job$1.run(Job.java:311)[inlined]
>     at
> jrockit/vm/AccessController.doPrivileged(AccessController.java:254)[inlined]
>     at javax/security/auth/Subject.doAs(Subject.java:396)[inlined]
>     at
> org/apache/hadoop/security/UserGroupInformation.doAs(UserGroupInformation.java:1491)[inlined]
>     at
> org/apache/hadoop/mapreduce/Job.updateStatus(Job.java:311)[optimized]
>     ^-- Holding lock: org/apache/hadoop/mapreduce/Job@0x1016e05a8[biased
> lock]
>     at org/apache/hadoop/mapreduce/Job.isComplete(Job.java:599)
>     at org/apache/hadoop/mapreduce/Job.waitForCompletion(Job.java:1294)
>
> You can see the thead holding the lock is in sleep state and the calling
> method is Connection.handleConnectionFailure(), so I checked the our log
> file and realized the connection failure is about historyserver is not
> available. In my case, I did not start historyserver at all, because it's
> not needed(I disabled log-aggregation), so my question is why the job
> client was still trying to talk to historyserver even log aggregation is
> disabled.
>
> Thanks
>
>
>
> On Mon, Sep 8, 2014 at 3:57 AM, Arun Murthy <ac...@hortonworks.com> wrote:
>
>> How many nodes do you have in your cluster?
>>
>> Also, could you share the CapacityScheduler initialization logs for each
>> queue, such as:
>>
>> 2014-08-14 15:14:23,835 INFO
>> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler:
>> Initialized queue: unfunded: capacity=0.5, absoluteCapacity=0.5,
>> usedResources=<memory:0, vCores:0>, usedCapacity=0.0,
>> absoluteUsedCapacity=0.0, numApps=0, numContainers=0
>> 2014-08-14 15:14:23,840 INFO
>> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue:
>> Initializing default
>> capacity = 0.5 [= (float) configuredCapacity / 100 ]
>> asboluteCapacity = 0.5 [= parentAbsoluteCapacity * capacity ]
>> maxCapacity = 1.0 [= configuredMaxCapacity ]
>> absoluteMaxCapacity = 1.0 [= 1.0 maximumCapacity undefined,
>> (parentAbsoluteMaxCapacity * maximumCapacity) / 100 otherwise ]
>> userLimit = 100 [= configuredUserLimit ]
>> userLimitFactor = 1.0 [= configuredUserLimitFactor ]
>> maxApplications = 5000 [= configuredMaximumSystemApplicationsPerQueue or
>> (int)(configuredMaximumSystemApplications * absoluteCapacity)]
>> maxApplicationsPerUser = 5000 [= (int)(maxApplications * (userLimit /
>> 100.0f) * userLimitFactor) ]
>> maxActiveApplications = 1 [= max((int)ceil((clusterResourceMemory /
>> minimumAllocation) * maxAMResourcePerQueuePercent * absoluteMaxCapacity),1)
>> ]
>> maxActiveAppsUsingAbsCap = 1 [= max((int)ceil((clusterResourceMemory /
>> minimumAllocation) *maxAMResourcePercent * absoluteCapacity),1) ]
>> maxActiveApplicationsPerUser = 1 [= max((int)(maxActiveApplications *
>> (userLimit / 100.0f) * userLimitFactor),1) ]
>> usedCapacity = 0.0 [= usedResourcesMemory / (clusterResourceMemory *
>> absoluteCapacity)]
>> absoluteUsedCapacity = 0.0 [= usedResourcesMemory / clusterResourceMemory]
>> maxAMResourcePerQueuePercent = 0.1 [= configuredMaximumAMResourcePercent ]
>> minimumAllocationFactor = 0.87506104 [= (float)(maximumAllocationMemory -
>> minimumAllocationMemory) / maximumAllocationMemory ]
>> numContainers = 0 [= currentNumContainers ]
>> state = RUNNING [= configuredState ]
>> acls = SUBMIT_APPLICATIONS: ADMINISTER_QUEUE:  [= configuredAcls ]
>> nodeLocalityDelay = 0
>>
>>
>> Then, look at values for maxActiveAppsUsingAbsCap &
>> maxActiveApplicationsPerUser. That should help debugging.
>>
>> thanks,
>> Arun
>>
>>
>> On Sun, Sep 7, 2014 at 9:37 AM, Anfernee Xu <an...@gmail.com>
>> wrote:
>>
>>> Hi,
>>>
>>> I'rm running my cluster at Hadoop 2.2.0,  and use CapacityScheduler. And
>>> all my jobs are uberized and running among 2 queues, one queue takes
>>> majority of capacity(90%), another take 10%. What I found is for small
>>> queue, only one job is running for a given time, I tried twisting below
>>> properties, but no luck so far, could you guys share some light on this?
>>>
>>>  <property>
>>>     <name>yarn.scheduler.capacity.maximum-am-resource-percent</name>
>>>     <value>1.0</value>
>>>     <description>
>>>       Maximum percent of resources in the cluster which can be used to
>>> run
>>>       application masters i.e. controls number of concurrent running
>>>       applications.
>>>     </description>
>>>   </property>
>>>
>>>
>>>  <property>
>>>     <name>yarn.scheduler.capacity.root.queues</name>
>>>     <value>default,small</value>
>>>     <description>
>>>       The queues at the this level (root is the root queue).
>>>     </description>
>>>   </property>
>>>
>>>  <property>
>>>
>>> <name>yarn.scheduler.capacity.root.small.maximum-am-resource-percent</name>
>>>     <value>1.0</value>
>>>   </property>
>>>
>>>
>>>  <property>
>>>     <name>yarn.scheduler.capacity.root.small.user-limit</name>
>>>     <value>1</value>
>>>   </property>
>>>
>>>  <property>
>>>     <name>yarn.scheduler.capacity.root.default.capacity</name>
>>>     <value>88</value>
>>>     <description>Default queue target capacity.</description>
>>>   </property>
>>>
>>>
>>>   <property>
>>>     <name>yarn.scheduler.capacity.root.small.capacity</name>
>>>     <value>12</value>
>>>     <description>Default queue target capacity.</description>
>>>   </property>
>>>
>>>  <property>
>>>     <name>yarn.scheduler.capacity.root.default.maximum-capacity</name>
>>>     <value>88</value>
>>>     <description>
>>>       The maximum capacity of the default queue.
>>>     </description>
>>>   </property>
>>>
>>>   <property>
>>>     <name>yarn.scheduler.capacity.root.small.maximum-capacity</name>
>>>     <value>12</value>
>>>     <description>Maximum queue capacity.</description>
>>>   </property>
>>>
>>>
>>> Thanks
>>>
>>> --
>>> --Anfernee
>>>
>>
>>
>>
>> --
>>
>> --
>> Arun C. Murthy
>> Hortonworks Inc.
>> http://hortonworks.com/
>>
>> CONFIDENTIALITY NOTICE
>> NOTICE: This message is intended for the use of the individual or entity
>> to which it is addressed and may contain information that is confidential,
>> privileged and exempt from disclosure under applicable law. If the reader
>> of this message is not the intended recipient, you are hereby notified that
>> any printing, copying, dissemination, distribution, disclosure or
>> forwarding of this communication is strictly prohibited. If you have
>> received this communication in error, please contact the sender immediately
>> and delete it from your system. Thank You.
>
>
>
>
> --
> --Anfernee
>



-- 

--
Arun C. Murthy
Hortonworks Inc.
http://hortonworks.com/

-- 
CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to 
which it is addressed and may contain information that is confidential, 
privileged and exempt from disclosure under applicable law. If the reader 
of this message is not the intended recipient, you are hereby notified that 
any printing, copying, dissemination, distribution, disclosure or 
forwarding of this communication is strictly prohibited. If you have 
received this communication in error, please contact the sender immediately 
and delete it from your system. Thank You.

Re: In Yarn how to increase the number of concurrent applications for a queue

Posted by Arun Murthy <ac...@hortonworks.com>.

Thanks for digging into this. Mind opening a jira to discuss further? Much
appreciated.

Arun

On Mon, Sep 8, 2014 at 7:15 PM, Anfernee Xu <an...@gmail.com> wrote:

> It turned out that it's not a configuration issue, some worker thread
> which submits job to Yarn was blocked, see below thread dump
>
> "pool-1-thread-160" id=194 idx=0x30c tid=886 prio=5 alive, blocked,
> native_blocked
>     -- Blocked trying to get lock:
> org/apache/hadoop/ipc/Client$Connection@0x1059d0c60[thin lock]
>     at __lll_lock_wait+36(:0)@0x340260d594
>     at tsSleep+399(threadsystem.c:83)@0x2b2356e5da80
>     at jrockit/vm/Threads.sleep(I)V(Native Method)
>     at jrockit/vm/Locks.waitForThinRelease(Locks.java:955)[optimized]
>     at
> jrockit/vm/Locks.monitorEnterSecondStageHard(Locks.java:1083)[optimized]
>     at jrockit/vm/Locks.monitorEnterSecondStage(Locks.java:1005)[optimized]
>     at
> org/apache/hadoop/ipc/Client$Connection.addCall(Client.java:400)[inlined]
>     at
> org/apache/hadoop/ipc/Client$Connection.access$2500(Client.java:314)[inlined]
>     at
> org/apache/hadoop/ipc/Client.getConnection(Client.java:1393)[optimized]
>     at org/apache/hadoop/ipc/Client.call(Client.java:1318)[inlined]
>     at org/apache/hadoop/ipc/Client.call(Client.java:1300)[inlined]
>     at
> org/apache/hadoop/ipc/ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206)[optimized]
>     at
> $Proxy21.getJobReport(Lcom/google/protobuf/RpcController;Lorg/apache/hadoop/mapreduce/v2/proto/MRServiceProtos$GetJobReportRequestProto;)Lorg/apache/hadoop/mapreduce/v2/proto/MRServiceProtos$GetJobReportResponseProto;(Unknown
> Source)
>     at
> org/apache/hadoop/mapreduce/v2/api/impl/pb/client/MRClientProtocolPBClientImpl.getJobReport(MRClientProtocolPBClientImpl.java:133)[optimized]
>     at
> sun/reflect/GeneratedMethodAccessor79.invoke(Ljava/lang/Object;[Ljava/lang/Object;)Ljava/lang/Object;(Unknown
> Source)[optimized]
>     at
> sun/reflect/DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)[optimized]
>     at java/lang/reflect/Method.invoke(Method.java:597)[inlined]
>     at
> org/apache/hadoop/mapred/ClientServiceDelegate.invoke(ClientServiceDelegate.java:317)[inlined]
>     at
> org/apache/hadoop/mapred/ClientServiceDelegate.getJobStatus(ClientServiceDelegate.java:416)[optimized]
>     ^-- Holding lock:
> org/apache/hadoop/mapred/ClientServiceDelegate@0x10087d788[biased lock]
>     at
> org/apache/hadoop/mapred/YarnRunner.getJobStatus(TIEYarnRunner.java:522)[optimized]
>     at org/apache/hadoop/mapreduce/Job$1.run(Job.java:314)[inlined]
>     at org/apache/hadoop/mapreduce/Job$1.run(Job.java:311)[inlined]
>     at
> jrockit/vm/AccessController.doPrivileged(AccessController.java:254)[inlined]
>     at javax/security/auth/Subject.doAs(Subject.java:396)[inlined]
>     at
> org/apache/hadoop/security/UserGroupInformation.doAs(UserGroupInformation.java:1491)[inlined]
>     at
> org/apache/hadoop/mapreduce/Job.updateStatus(Job.java:311)[optimized]
>     ^-- Holding lock: org/apache/hadoop/mapreduce/Job@0x100522fb8[biased
> lock]
>     at org/apache/hadoop/mapreduce/Job.isComplete(Job.java:599)
>     at org/apache/hadoop/mapreduce/Job.waitForCompletion(Job.java:1294)
>
> The lock was held by
>
> "pool-1-thread-10" id=44 idx=0xb4 tid=736 prio=5 alive, sleeping,
> native_waiting
>     at pthread_cond_timedwait@@GLIBC_2.3.2+288(:0)@0x340260b1c0
>     at eventTimedWaitNoTransitionImpl+46(event.c:93)@0x2b2356cc741f
>     at
> syncWaitForSignalNoTransition+133(synchronization.c:51)@0x2b2356e5a096
>     at syncWaitForSignal+189(synchronization.c:85)@0x2b2356e5a1ae
>     at vmtSleep+165(signaling.c:197)@0x2b2356e35ef6
>     at JVM_Sleep+188(jvmthreads.c:119)@0x2b2356d6bb7d
>     at java/lang/Thread.sleep(J)V(Native Method)
>     at
> org/apache/hadoop/ipc/Client$Connection.handleConnectionFailure(Client.java:778)[optimized]
>     at
> org/apache/hadoop/ipc/Client$Connection.setupConnection(Client.java:566)[optimized]
>     ^-- Holding lock: org/apache/hadoop/ipc/Client$Connection@0x1059d0c60
> [recursive]
>     at
> org/apache/hadoop/ipc/Client$Connection.setupIOstreams(Client.java:642)[optimized]
>     ^-- Holding lock: org/apache/hadoop/ipc/Client$Connection@0x1059d0c60[thin
> lock]
>     at
> org/apache/hadoop/ipc/Client$Connection.access$2600(Client.java:314)[inlined]
>     at
> org/apache/hadoop/ipc/Client.getConnection(Client.java:1399)[optimized]
>     at org/apache/hadoop/ipc/Client.call(Client.java:1318)[inlined]
>     at org/apache/hadoop/ipc/Client.call(Client.java:1300)[inlined]
>     at
> org/apache/hadoop/ipc/ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206)[optimized]
>     at
> $Proxy21.getJobReport(Lcom/google/protobuf/RpcController;Lorg/apache/hadoop/mapreduce/v2/proto/MRServiceProtos$GetJobReportRequestProto;)Lorg/apache/hadoop/mapreduce/v2/proto/MRServiceProtos$GetJobReportResponseProto;(Unknown
> Source)
>     at
> org/apache/hadoop/mapreduce/v2/api/impl/pb/client/MRClientProtocolPBClientImpl.getJobReport(MRClientProtocolPBClientImpl.java:133)[optimized]
>     at
> sun/reflect/GeneratedMethodAccessor79.invoke(Ljava/lang/Object;[Ljava/lang/Object;)Ljava/lang/Object;(Unknown
> Source)[optimized]
>     at
> sun/reflect/DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)[optimized]
>     at java/lang/reflect/Method.invoke(Method.java:597)[inlined]
>     at
> org/apache/hadoop/mapred/ClientServiceDelegate.invoke(ClientServiceDelegate.java:317)[inlined]
>     at
> org/apache/hadoop/mapred/ClientServiceDelegate.getJobStatus(ClientServiceDelegate.java:416)[optimized]
>     ^-- Holding lock:
> org/apache/hadoop/mapred/ClientServiceDelegate@0x1012c34f8[biased lock]
>     at
> org/apache/hadoop/mapred/YarnRunner.getJobStatus(TIEYarnRunner.java:522)[optimized]
>     at org/apache/hadoop/mapreduce/Job$1.run(Job.java:314)[inlined]
>     at org/apache/hadoop/mapreduce/Job$1.run(Job.java:311)[inlined]
>     at
> jrockit/vm/AccessController.doPrivileged(AccessController.java:254)[inlined]
>     at javax/security/auth/Subject.doAs(Subject.java:396)[inlined]
>     at
> org/apache/hadoop/security/UserGroupInformation.doAs(UserGroupInformation.java:1491)[inlined]
>     at
> org/apache/hadoop/mapreduce/Job.updateStatus(Job.java:311)[optimized]
>     ^-- Holding lock: org/apache/hadoop/mapreduce/Job@0x1016e05a8[biased
> lock]
>     at org/apache/hadoop/mapreduce/Job.isComplete(Job.java:599)
>     at org/apache/hadoop/mapreduce/Job.waitForCompletion(Job.java:1294)
>
> You can see the thead holding the lock is in sleep state and the calling
> method is Connection.handleConnectionFailure(), so I checked the our log
> file and realized the connection failure is about historyserver is not
> available. In my case, I did not start historyserver at all, because it's
> not needed(I disabled log-aggregation), so my question is why the job
> client was still trying to talk to historyserver even log aggregation is
> disabled.
>
> Thanks
>
>
>
> On Mon, Sep 8, 2014 at 3:57 AM, Arun Murthy <ac...@hortonworks.com> wrote:
>
>> How many nodes do you have in your cluster?
>>
>> Also, could you share the CapacityScheduler initialization logs for each
>> queue, such as:
>>
>> 2014-08-14 15:14:23,835 INFO
>> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler:
>> Initialized queue: unfunded: capacity=0.5, absoluteCapacity=0.5,
>> usedResources=<memory:0, vCores:0>, usedCapacity=0.0,
>> absoluteUsedCapacity=0.0, numApps=0, numContainers=0
>> 2014-08-14 15:14:23,840 INFO
>> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue:
>> Initializing default
>> capacity = 0.5 [= (float) configuredCapacity / 100 ]
>> asboluteCapacity = 0.5 [= parentAbsoluteCapacity * capacity ]
>> maxCapacity = 1.0 [= configuredMaxCapacity ]
>> absoluteMaxCapacity = 1.0 [= 1.0 maximumCapacity undefined,
>> (parentAbsoluteMaxCapacity * maximumCapacity) / 100 otherwise ]
>> userLimit = 100 [= configuredUserLimit ]
>> userLimitFactor = 1.0 [= configuredUserLimitFactor ]
>> maxApplications = 5000 [= configuredMaximumSystemApplicationsPerQueue or
>> (int)(configuredMaximumSystemApplications * absoluteCapacity)]
>> maxApplicationsPerUser = 5000 [= (int)(maxApplications * (userLimit /
>> 100.0f) * userLimitFactor) ]
>> maxActiveApplications = 1 [= max((int)ceil((clusterResourceMemory /
>> minimumAllocation) * maxAMResourcePerQueuePercent * absoluteMaxCapacity),1)
>> ]
>> maxActiveAppsUsingAbsCap = 1 [= max((int)ceil((clusterResourceMemory /
>> minimumAllocation) *maxAMResourcePercent * absoluteCapacity),1) ]
>> maxActiveApplicationsPerUser = 1 [= max((int)(maxActiveApplications *
>> (userLimit / 100.0f) * userLimitFactor),1) ]
>> usedCapacity = 0.0 [= usedResourcesMemory / (clusterResourceMemory *
>> absoluteCapacity)]
>> absoluteUsedCapacity = 0.0 [= usedResourcesMemory / clusterResourceMemory]
>> maxAMResourcePerQueuePercent = 0.1 [= configuredMaximumAMResourcePercent ]
>> minimumAllocationFactor = 0.87506104 [= (float)(maximumAllocationMemory -
>> minimumAllocationMemory) / maximumAllocationMemory ]
>> numContainers = 0 [= currentNumContainers ]
>> state = RUNNING [= configuredState ]
>> acls = SUBMIT_APPLICATIONS: ADMINISTER_QUEUE:  [= configuredAcls ]
>> nodeLocalityDelay = 0
>>
>>
>> Then, look at values for maxActiveAppsUsingAbsCap &
>> maxActiveApplicationsPerUser. That should help debugging.
>>
>> thanks,
>> Arun
>>
>>
>> On Sun, Sep 7, 2014 at 9:37 AM, Anfernee Xu <an...@gmail.com>
>> wrote:
>>
>>> Hi,
>>>
>>> I'rm running my cluster at Hadoop 2.2.0,  and use CapacityScheduler. And
>>> all my jobs are uberized and running among 2 queues, one queue takes
>>> majority of capacity(90%), another take 10%. What I found is for small
>>> queue, only one job is running for a given time, I tried twisting below
>>> properties, but no luck so far, could you guys share some light on this?
>>>
>>>  <property>
>>>     <name>yarn.scheduler.capacity.maximum-am-resource-percent</name>
>>>     <value>1.0</value>
>>>     <description>
>>>       Maximum percent of resources in the cluster which can be used to
>>> run
>>>       application masters i.e. controls number of concurrent running
>>>       applications.
>>>     </description>
>>>   </property>
>>>
>>>
>>>  <property>
>>>     <name>yarn.scheduler.capacity.root.queues</name>
>>>     <value>default,small</value>
>>>     <description>
>>>       The queues at the this level (root is the root queue).
>>>     </description>
>>>   </property>
>>>
>>>  <property>
>>>
>>> <name>yarn.scheduler.capacity.root.small.maximum-am-resource-percent</name>
>>>     <value>1.0</value>
>>>   </property>
>>>
>>>
>>>  <property>
>>>     <name>yarn.scheduler.capacity.root.small.user-limit</name>
>>>     <value>1</value>
>>>   </property>
>>>
>>>  <property>
>>>     <name>yarn.scheduler.capacity.root.default.capacity</name>
>>>     <value>88</value>
>>>     <description>Default queue target capacity.</description>
>>>   </property>
>>>
>>>
>>>   <property>
>>>     <name>yarn.scheduler.capacity.root.small.capacity</name>
>>>     <value>12</value>
>>>     <description>Default queue target capacity.</description>
>>>   </property>
>>>
>>>  <property>
>>>     <name>yarn.scheduler.capacity.root.default.maximum-capacity</name>
>>>     <value>88</value>
>>>     <description>
>>>       The maximum capacity of the default queue.
>>>     </description>
>>>   </property>
>>>
>>>   <property>
>>>     <name>yarn.scheduler.capacity.root.small.maximum-capacity</name>
>>>     <value>12</value>
>>>     <description>Maximum queue capacity.</description>
>>>   </property>
>>>
>>>
>>> Thanks
>>>
>>> --
>>> --Anfernee
>>>
>>
>>
>>
>> --
>>
>> --
>> Arun C. Murthy
>> Hortonworks Inc.
>> http://hortonworks.com/
>>
>> CONFIDENTIALITY NOTICE
>> NOTICE: This message is intended for the use of the individual or entity
>> to which it is addressed and may contain information that is confidential,
>> privileged and exempt from disclosure under applicable law. If the reader
>> of this message is not the intended recipient, you are hereby notified that
>> any printing, copying, dissemination, distribution, disclosure or
>> forwarding of this communication is strictly prohibited. If you have
>> received this communication in error, please contact the sender immediately
>> and delete it from your system. Thank You.
>
>
>
>
> --
> --Anfernee
>



-- 

--
Arun C. Murthy
Hortonworks Inc.
http://hortonworks.com/

-- 
CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to 
which it is addressed and may contain information that is confidential, 
privileged and exempt from disclosure under applicable law. If the reader 
of this message is not the intended recipient, you are hereby notified that 
any printing, copying, dissemination, distribution, disclosure or 
forwarding of this communication is strictly prohibited. If you have 
received this communication in error, please contact the sender immediately 
and delete it from your system. Thank You.

Re: In Yarn how to increase the number of concurrent applications for a queue

Posted by Anfernee Xu <an...@gmail.com>.

It turned out that it's not a configuration issue, some worker thread which
submits job to Yarn was blocked, see below thread dump

"pool-1-thread-160" id=194 idx=0x30c tid=886 prio=5 alive, blocked,
native_blocked
    -- Blocked trying to get lock:
org/apache/hadoop/ipc/Client$Connection@0x1059d0c60[thin lock]
    at __lll_lock_wait+36(:0)@0x340260d594
    at tsSleep+399(threadsystem.c:83)@0x2b2356e5da80
    at jrockit/vm/Threads.sleep(I)V(Native Method)
    at jrockit/vm/Locks.waitForThinRelease(Locks.java:955)[optimized]
    at
jrockit/vm/Locks.monitorEnterSecondStageHard(Locks.java:1083)[optimized]
    at jrockit/vm/Locks.monitorEnterSecondStage(Locks.java:1005)[optimized]
    at
org/apache/hadoop/ipc/Client$Connection.addCall(Client.java:400)[inlined]
    at
org/apache/hadoop/ipc/Client$Connection.access$2500(Client.java:314)[inlined]
    at
org/apache/hadoop/ipc/Client.getConnection(Client.java:1393)[optimized]
    at org/apache/hadoop/ipc/Client.call(Client.java:1318)[inlined]
    at org/apache/hadoop/ipc/Client.call(Client.java:1300)[inlined]
    at
org/apache/hadoop/ipc/ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206)[optimized]
    at
$Proxy21.getJobReport(Lcom/google/protobuf/RpcController;Lorg/apache/hadoop/mapreduce/v2/proto/MRServiceProtos$GetJobReportRequestProto;)Lorg/apache/hadoop/mapreduce/v2/proto/MRServiceProtos$GetJobReportResponseProto;(Unknown
Source)
    at
org/apache/hadoop/mapreduce/v2/api/impl/pb/client/MRClientProtocolPBClientImpl.getJobReport(MRClientProtocolPBClientImpl.java:133)[optimized]
    at
sun/reflect/GeneratedMethodAccessor79.invoke(Ljava/lang/Object;[Ljava/lang/Object;)Ljava/lang/Object;(Unknown
Source)[optimized]
    at
sun/reflect/DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)[optimized]
    at java/lang/reflect/Method.invoke(Method.java:597)[inlined]
    at
org/apache/hadoop/mapred/ClientServiceDelegate.invoke(ClientServiceDelegate.java:317)[inlined]
    at
org/apache/hadoop/mapred/ClientServiceDelegate.getJobStatus(ClientServiceDelegate.java:416)[optimized]
    ^-- Holding lock:
org/apache/hadoop/mapred/ClientServiceDelegate@0x10087d788[biased lock]
    at
org/apache/hadoop/mapred/YarnRunner.getJobStatus(TIEYarnRunner.java:522)[optimized]
    at org/apache/hadoop/mapreduce/Job$1.run(Job.java:314)[inlined]
    at org/apache/hadoop/mapreduce/Job$1.run(Job.java:311)[inlined]
    at
jrockit/vm/AccessController.doPrivileged(AccessController.java:254)[inlined]
    at javax/security/auth/Subject.doAs(Subject.java:396)[inlined]
    at
org/apache/hadoop/security/UserGroupInformation.doAs(UserGroupInformation.java:1491)[inlined]
    at org/apache/hadoop/mapreduce/Job.updateStatus(Job.java:311)[optimized]
    ^-- Holding lock: org/apache/hadoop/mapreduce/Job@0x100522fb8[biased
lock]
    at org/apache/hadoop/mapreduce/Job.isComplete(Job.java:599)
    at org/apache/hadoop/mapreduce/Job.waitForCompletion(Job.java:1294)

The lock was held by

"pool-1-thread-10" id=44 idx=0xb4 tid=736 prio=5 alive, sleeping,
native_waiting
    at pthread_cond_timedwait@@GLIBC_2.3.2+288(:0)@0x340260b1c0
    at eventTimedWaitNoTransitionImpl+46(event.c:93)@0x2b2356cc741f
    at
syncWaitForSignalNoTransition+133(synchronization.c:51)@0x2b2356e5a096
    at syncWaitForSignal+189(synchronization.c:85)@0x2b2356e5a1ae
    at vmtSleep+165(signaling.c:197)@0x2b2356e35ef6
    at JVM_Sleep+188(jvmthreads.c:119)@0x2b2356d6bb7d
    at java/lang/Thread.sleep(J)V(Native Method)
    at
org/apache/hadoop/ipc/Client$Connection.handleConnectionFailure(Client.java:778)[optimized]
    at
org/apache/hadoop/ipc/Client$Connection.setupConnection(Client.java:566)[optimized]
    ^-- Holding lock: org/apache/hadoop/ipc/Client$Connection@0x1059d0c60
[recursive]
    at
org/apache/hadoop/ipc/Client$Connection.setupIOstreams(Client.java:642)[optimized]
    ^-- Holding lock: org/apache/hadoop/ipc/Client$Connection@0x1059d0c60[thin
lock]
    at
org/apache/hadoop/ipc/Client$Connection.access$2600(Client.java:314)[inlined]
    at
org/apache/hadoop/ipc/Client.getConnection(Client.java:1399)[optimized]
    at org/apache/hadoop/ipc/Client.call(Client.java:1318)[inlined]
    at org/apache/hadoop/ipc/Client.call(Client.java:1300)[inlined]
    at
org/apache/hadoop/ipc/ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206)[optimized]
    at
$Proxy21.getJobReport(Lcom/google/protobuf/RpcController;Lorg/apache/hadoop/mapreduce/v2/proto/MRServiceProtos$GetJobReportRequestProto;)Lorg/apache/hadoop/mapreduce/v2/proto/MRServiceProtos$GetJobReportResponseProto;(Unknown
Source)
    at
org/apache/hadoop/mapreduce/v2/api/impl/pb/client/MRClientProtocolPBClientImpl.getJobReport(MRClientProtocolPBClientImpl.java:133)[optimized]
    at
sun/reflect/GeneratedMethodAccessor79.invoke(Ljava/lang/Object;[Ljava/lang/Object;)Ljava/lang/Object;(Unknown
Source)[optimized]
    at
sun/reflect/DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)[optimized]
    at java/lang/reflect/Method.invoke(Method.java:597)[inlined]
    at
org/apache/hadoop/mapred/ClientServiceDelegate.invoke(ClientServiceDelegate.java:317)[inlined]
    at
org/apache/hadoop/mapred/ClientServiceDelegate.getJobStatus(ClientServiceDelegate.java:416)[optimized]
    ^-- Holding lock:
org/apache/hadoop/mapred/ClientServiceDelegate@0x1012c34f8[biased lock]
    at
org/apache/hadoop/mapred/YarnRunner.getJobStatus(TIEYarnRunner.java:522)[optimized]
    at org/apache/hadoop/mapreduce/Job$1.run(Job.java:314)[inlined]
    at org/apache/hadoop/mapreduce/Job$1.run(Job.java:311)[inlined]
    at
jrockit/vm/AccessController.doPrivileged(AccessController.java:254)[inlined]
    at javax/security/auth/Subject.doAs(Subject.java:396)[inlined]
    at
org/apache/hadoop/security/UserGroupInformation.doAs(UserGroupInformation.java:1491)[inlined]
    at org/apache/hadoop/mapreduce/Job.updateStatus(Job.java:311)[optimized]
    ^-- Holding lock: org/apache/hadoop/mapreduce/Job@0x1016e05a8[biased
lock]
    at org/apache/hadoop/mapreduce/Job.isComplete(Job.java:599)
    at org/apache/hadoop/mapreduce/Job.waitForCompletion(Job.java:1294)

You can see the thead holding the lock is in sleep state and the calling
method is Connection.handleConnectionFailure(), so I checked the our log
file and realized the connection failure is about historyserver is not
available. In my case, I did not start historyserver at all, because it's
not needed(I disabled log-aggregation), so my question is why the job
client was still trying to talk to historyserver even log aggregation is
disabled.

Thanks



On Mon, Sep 8, 2014 at 3:57 AM, Arun Murthy <ac...@hortonworks.com> wrote:

> How many nodes do you have in your cluster?
>
> Also, could you share the CapacityScheduler initialization logs for each
> queue, such as:
>
> 2014-08-14 15:14:23,835 INFO
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler:
> Initialized queue: unfunded: capacity=0.5, absoluteCapacity=0.5,
> usedResources=<memory:0, vCores:0>, usedCapacity=0.0,
> absoluteUsedCapacity=0.0, numApps=0, numContainers=0
> 2014-08-14 15:14:23,840 INFO
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue:
> Initializing default
> capacity = 0.5 [= (float) configuredCapacity / 100 ]
> asboluteCapacity = 0.5 [= parentAbsoluteCapacity * capacity ]
> maxCapacity = 1.0 [= configuredMaxCapacity ]
> absoluteMaxCapacity = 1.0 [= 1.0 maximumCapacity undefined,
> (parentAbsoluteMaxCapacity * maximumCapacity) / 100 otherwise ]
> userLimit = 100 [= configuredUserLimit ]
> userLimitFactor = 1.0 [= configuredUserLimitFactor ]
> maxApplications = 5000 [= configuredMaximumSystemApplicationsPerQueue or
> (int)(configuredMaximumSystemApplications * absoluteCapacity)]
> maxApplicationsPerUser = 5000 [= (int)(maxApplications * (userLimit /
> 100.0f) * userLimitFactor) ]
> maxActiveApplications = 1 [= max((int)ceil((clusterResourceMemory /
> minimumAllocation) * maxAMResourcePerQueuePercent * absoluteMaxCapacity),1)
> ]
> maxActiveAppsUsingAbsCap = 1 [= max((int)ceil((clusterResourceMemory /
> minimumAllocation) *maxAMResourcePercent * absoluteCapacity),1) ]
> maxActiveApplicationsPerUser = 1 [= max((int)(maxActiveApplications *
> (userLimit / 100.0f) * userLimitFactor),1) ]
> usedCapacity = 0.0 [= usedResourcesMemory / (clusterResourceMemory *
> absoluteCapacity)]
> absoluteUsedCapacity = 0.0 [= usedResourcesMemory / clusterResourceMemory]
> maxAMResourcePerQueuePercent = 0.1 [= configuredMaximumAMResourcePercent ]
> minimumAllocationFactor = 0.87506104 [= (float)(maximumAllocationMemory -
> minimumAllocationMemory) / maximumAllocationMemory ]
> numContainers = 0 [= currentNumContainers ]
> state = RUNNING [= configuredState ]
> acls = SUBMIT_APPLICATIONS: ADMINISTER_QUEUE:  [= configuredAcls ]
> nodeLocalityDelay = 0
>
>
> Then, look at values for maxActiveAppsUsingAbsCap &
> maxActiveApplicationsPerUser. That should help debugging.
>
> thanks,
> Arun
>
>
> On Sun, Sep 7, 2014 at 9:37 AM, Anfernee Xu <an...@gmail.com> wrote:
>
>> Hi,
>>
>> I'rm running my cluster at Hadoop 2.2.0,  and use CapacityScheduler. And
>> all my jobs are uberized and running among 2 queues, one queue takes
>> majority of capacity(90%), another take 10%. What I found is for small
>> queue, only one job is running for a given time, I tried twisting below
>> properties, but no luck so far, could you guys share some light on this?
>>
>>  <property>
>>     <name>yarn.scheduler.capacity.maximum-am-resource-percent</name>
>>     <value>1.0</value>
>>     <description>
>>       Maximum percent of resources in the cluster which can be used to run
>>       application masters i.e. controls number of concurrent running
>>       applications.
>>     </description>
>>   </property>
>>
>>
>>  <property>
>>     <name>yarn.scheduler.capacity.root.queues</name>
>>     <value>default,small</value>
>>     <description>
>>       The queues at the this level (root is the root queue).
>>     </description>
>>   </property>
>>
>>  <property>
>>
>> <name>yarn.scheduler.capacity.root.small.maximum-am-resource-percent</name>
>>     <value>1.0</value>
>>   </property>
>>
>>
>>  <property>
>>     <name>yarn.scheduler.capacity.root.small.user-limit</name>
>>     <value>1</value>
>>   </property>
>>
>>  <property>
>>     <name>yarn.scheduler.capacity.root.default.capacity</name>
>>     <value>88</value>
>>     <description>Default queue target capacity.</description>
>>   </property>
>>
>>
>>   <property>
>>     <name>yarn.scheduler.capacity.root.small.capacity</name>
>>     <value>12</value>
>>     <description>Default queue target capacity.</description>
>>   </property>
>>
>>  <property>
>>     <name>yarn.scheduler.capacity.root.default.maximum-capacity</name>
>>     <value>88</value>
>>     <description>
>>       The maximum capacity of the default queue.
>>     </description>
>>   </property>
>>
>>   <property>
>>     <name>yarn.scheduler.capacity.root.small.maximum-capacity</name>
>>     <value>12</value>
>>     <description>Maximum queue capacity.</description>
>>   </property>
>>
>>
>> Thanks
>>
>> --
>> --Anfernee
>>
>
>
>
> --
>
> --
> Arun C. Murthy
> Hortonworks Inc.
> http://hortonworks.com/
>
> CONFIDENTIALITY NOTICE
> NOTICE: This message is intended for the use of the individual or entity
> to which it is addressed and may contain information that is confidential,
> privileged and exempt from disclosure under applicable law. If the reader
> of this message is not the intended recipient, you are hereby notified that
> any printing, copying, dissemination, distribution, disclosure or
> forwarding of this communication is strictly prohibited. If you have
> received this communication in error, please contact the sender immediately
> and delete it from your system. Thank You.




-- 
--Anfernee

Re: In Yarn how to increase the number of concurrent applications for a queue

Posted by Anfernee Xu <an...@gmail.com>.

It turned out that it's not a configuration issue, some worker thread which
submits job to Yarn was blocked, see below thread dump

"pool-1-thread-160" id=194 idx=0x30c tid=886 prio=5 alive, blocked,
native_blocked
    -- Blocked trying to get lock:
org/apache/hadoop/ipc/Client$Connection@0x1059d0c60[thin lock]
    at __lll_lock_wait+36(:0)@0x340260d594
    at tsSleep+399(threadsystem.c:83)@0x2b2356e5da80
    at jrockit/vm/Threads.sleep(I)V(Native Method)
    at jrockit/vm/Locks.waitForThinRelease(Locks.java:955)[optimized]
    at
jrockit/vm/Locks.monitorEnterSecondStageHard(Locks.java:1083)[optimized]
    at jrockit/vm/Locks.monitorEnterSecondStage(Locks.java:1005)[optimized]
    at
org/apache/hadoop/ipc/Client$Connection.addCall(Client.java:400)[inlined]
    at
org/apache/hadoop/ipc/Client$Connection.access$2500(Client.java:314)[inlined]
    at
org/apache/hadoop/ipc/Client.getConnection(Client.java:1393)[optimized]
    at org/apache/hadoop/ipc/Client.call(Client.java:1318)[inlined]
    at org/apache/hadoop/ipc/Client.call(Client.java:1300)[inlined]
    at
org/apache/hadoop/ipc/ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206)[optimized]
    at
$Proxy21.getJobReport(Lcom/google/protobuf/RpcController;Lorg/apache/hadoop/mapreduce/v2/proto/MRServiceProtos$GetJobReportRequestProto;)Lorg/apache/hadoop/mapreduce/v2/proto/MRServiceProtos$GetJobReportResponseProto;(Unknown
Source)
    at
org/apache/hadoop/mapreduce/v2/api/impl/pb/client/MRClientProtocolPBClientImpl.getJobReport(MRClientProtocolPBClientImpl.java:133)[optimized]
    at
sun/reflect/GeneratedMethodAccessor79.invoke(Ljava/lang/Object;[Ljava/lang/Object;)Ljava/lang/Object;(Unknown
Source)[optimized]
    at
sun/reflect/DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)[optimized]
    at java/lang/reflect/Method.invoke(Method.java:597)[inlined]
    at
org/apache/hadoop/mapred/ClientServiceDelegate.invoke(ClientServiceDelegate.java:317)[inlined]
    at
org/apache/hadoop/mapred/ClientServiceDelegate.getJobStatus(ClientServiceDelegate.java:416)[optimized]
    ^-- Holding lock:
org/apache/hadoop/mapred/ClientServiceDelegate@0x10087d788[biased lock]
    at
org/apache/hadoop/mapred/YarnRunner.getJobStatus(TIEYarnRunner.java:522)[optimized]
    at org/apache/hadoop/mapreduce/Job$1.run(Job.java:314)[inlined]
    at org/apache/hadoop/mapreduce/Job$1.run(Job.java:311)[inlined]
    at
jrockit/vm/AccessController.doPrivileged(AccessController.java:254)[inlined]
    at javax/security/auth/Subject.doAs(Subject.java:396)[inlined]
    at
org/apache/hadoop/security/UserGroupInformation.doAs(UserGroupInformation.java:1491)[inlined]
    at org/apache/hadoop/mapreduce/Job.updateStatus(Job.java:311)[optimized]
    ^-- Holding lock: org/apache/hadoop/mapreduce/Job@0x100522fb8[biased
lock]
    at org/apache/hadoop/mapreduce/Job.isComplete(Job.java:599)
    at org/apache/hadoop/mapreduce/Job.waitForCompletion(Job.java:1294)

The lock was held by

"pool-1-thread-10" id=44 idx=0xb4 tid=736 prio=5 alive, sleeping,
native_waiting
    at pthread_cond_timedwait@@GLIBC_2.3.2+288(:0)@0x340260b1c0
    at eventTimedWaitNoTransitionImpl+46(event.c:93)@0x2b2356cc741f
    at
syncWaitForSignalNoTransition+133(synchronization.c:51)@0x2b2356e5a096
    at syncWaitForSignal+189(synchronization.c:85)@0x2b2356e5a1ae
    at vmtSleep+165(signaling.c:197)@0x2b2356e35ef6
    at JVM_Sleep+188(jvmthreads.c:119)@0x2b2356d6bb7d
    at java/lang/Thread.sleep(J)V(Native Method)
    at
org/apache/hadoop/ipc/Client$Connection.handleConnectionFailure(Client.java:778)[optimized]
    at
org/apache/hadoop/ipc/Client$Connection.setupConnection(Client.java:566)[optimized]
    ^-- Holding lock: org/apache/hadoop/ipc/Client$Connection@0x1059d0c60
[recursive]
    at
org/apache/hadoop/ipc/Client$Connection.setupIOstreams(Client.java:642)[optimized]
    ^-- Holding lock: org/apache/hadoop/ipc/Client$Connection@0x1059d0c60[thin
lock]
    at
org/apache/hadoop/ipc/Client$Connection.access$2600(Client.java:314)[inlined]
    at
org/apache/hadoop/ipc/Client.getConnection(Client.java:1399)[optimized]
    at org/apache/hadoop/ipc/Client.call(Client.java:1318)[inlined]
    at org/apache/hadoop/ipc/Client.call(Client.java:1300)[inlined]
    at
org/apache/hadoop/ipc/ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206)[optimized]
    at
$Proxy21.getJobReport(Lcom/google/protobuf/RpcController;Lorg/apache/hadoop/mapreduce/v2/proto/MRServiceProtos$GetJobReportRequestProto;)Lorg/apache/hadoop/mapreduce/v2/proto/MRServiceProtos$GetJobReportResponseProto;(Unknown
Source)
    at
org/apache/hadoop/mapreduce/v2/api/impl/pb/client/MRClientProtocolPBClientImpl.getJobReport(MRClientProtocolPBClientImpl.java:133)[optimized]
    at
sun/reflect/GeneratedMethodAccessor79.invoke(Ljava/lang/Object;[Ljava/lang/Object;)Ljava/lang/Object;(Unknown
Source)[optimized]
    at
sun/reflect/DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)[optimized]
    at java/lang/reflect/Method.invoke(Method.java:597)[inlined]
    at
org/apache/hadoop/mapred/ClientServiceDelegate.invoke(ClientServiceDelegate.java:317)[inlined]
    at
org/apache/hadoop/mapred/ClientServiceDelegate.getJobStatus(ClientServiceDelegate.java:416)[optimized]
    ^-- Holding lock:
org/apache/hadoop/mapred/ClientServiceDelegate@0x1012c34f8[biased lock]
    at
org/apache/hadoop/mapred/YarnRunner.getJobStatus(TIEYarnRunner.java:522)[optimized]
    at org/apache/hadoop/mapreduce/Job$1.run(Job.java:314)[inlined]
    at org/apache/hadoop/mapreduce/Job$1.run(Job.java:311)[inlined]
    at
jrockit/vm/AccessController.doPrivileged(AccessController.java:254)[inlined]
    at javax/security/auth/Subject.doAs(Subject.java:396)[inlined]
    at
org/apache/hadoop/security/UserGroupInformation.doAs(UserGroupInformation.java:1491)[inlined]
    at org/apache/hadoop/mapreduce/Job.updateStatus(Job.java:311)[optimized]
    ^-- Holding lock: org/apache/hadoop/mapreduce/Job@0x1016e05a8[biased
lock]
    at org/apache/hadoop/mapreduce/Job.isComplete(Job.java:599)
    at org/apache/hadoop/mapreduce/Job.waitForCompletion(Job.java:1294)

You can see the thead holding the lock is in sleep state and the calling
method is Connection.handleConnectionFailure(), so I checked the our log
file and realized the connection failure is about historyserver is not
available. In my case, I did not start historyserver at all, because it's
not needed(I disabled log-aggregation), so my question is why the job
client was still trying to talk to historyserver even log aggregation is
disabled.

Thanks



On Mon, Sep 8, 2014 at 3:57 AM, Arun Murthy <ac...@hortonworks.com> wrote:

> How many nodes do you have in your cluster?
>
> Also, could you share the CapacityScheduler initialization logs for each
> queue, such as:
>
> 2014-08-14 15:14:23,835 INFO
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler:
> Initialized queue: unfunded: capacity=0.5, absoluteCapacity=0.5,
> usedResources=<memory:0, vCores:0>, usedCapacity=0.0,
> absoluteUsedCapacity=0.0, numApps=0, numContainers=0
> 2014-08-14 15:14:23,840 INFO
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue:
> Initializing default
> capacity = 0.5 [= (float) configuredCapacity / 100 ]
> asboluteCapacity = 0.5 [= parentAbsoluteCapacity * capacity ]
> maxCapacity = 1.0 [= configuredMaxCapacity ]
> absoluteMaxCapacity = 1.0 [= 1.0 maximumCapacity undefined,
> (parentAbsoluteMaxCapacity * maximumCapacity) / 100 otherwise ]
> userLimit = 100 [= configuredUserLimit ]
> userLimitFactor = 1.0 [= configuredUserLimitFactor ]
> maxApplications = 5000 [= configuredMaximumSystemApplicationsPerQueue or
> (int)(configuredMaximumSystemApplications * absoluteCapacity)]
> maxApplicationsPerUser = 5000 [= (int)(maxApplications * (userLimit /
> 100.0f) * userLimitFactor) ]
> maxActiveApplications = 1 [= max((int)ceil((clusterResourceMemory /
> minimumAllocation) * maxAMResourcePerQueuePercent * absoluteMaxCapacity),1)
> ]
> maxActiveAppsUsingAbsCap = 1 [= max((int)ceil((clusterResourceMemory /
> minimumAllocation) *maxAMResourcePercent * absoluteCapacity),1) ]
> maxActiveApplicationsPerUser = 1 [= max((int)(maxActiveApplications *
> (userLimit / 100.0f) * userLimitFactor),1) ]
> usedCapacity = 0.0 [= usedResourcesMemory / (clusterResourceMemory *
> absoluteCapacity)]
> absoluteUsedCapacity = 0.0 [= usedResourcesMemory / clusterResourceMemory]
> maxAMResourcePerQueuePercent = 0.1 [= configuredMaximumAMResourcePercent ]
> minimumAllocationFactor = 0.87506104 [= (float)(maximumAllocationMemory -
> minimumAllocationMemory) / maximumAllocationMemory ]
> numContainers = 0 [= currentNumContainers ]
> state = RUNNING [= configuredState ]
> acls = SUBMIT_APPLICATIONS: ADMINISTER_QUEUE:  [= configuredAcls ]
> nodeLocalityDelay = 0
>
>
> Then, look at values for maxActiveAppsUsingAbsCap &
> maxActiveApplicationsPerUser. That should help debugging.
>
> thanks,
> Arun
>
>
> On Sun, Sep 7, 2014 at 9:37 AM, Anfernee Xu <an...@gmail.com> wrote:
>
>> Hi,
>>
>> I'rm running my cluster at Hadoop 2.2.0,  and use CapacityScheduler. And
>> all my jobs are uberized and running among 2 queues, one queue takes
>> majority of capacity(90%), another take 10%. What I found is for small
>> queue, only one job is running for a given time, I tried twisting below
>> properties, but no luck so far, could you guys share some light on this?
>>
>>  <property>
>>     <name>yarn.scheduler.capacity.maximum-am-resource-percent</name>
>>     <value>1.0</value>
>>     <description>
>>       Maximum percent of resources in the cluster which can be used to run
>>       application masters i.e. controls number of concurrent running
>>       applications.
>>     </description>
>>   </property>
>>
>>
>>  <property>
>>     <name>yarn.scheduler.capacity.root.queues</name>
>>     <value>default,small</value>
>>     <description>
>>       The queues at the this level (root is the root queue).
>>     </description>
>>   </property>
>>
>>  <property>
>>
>> <name>yarn.scheduler.capacity.root.small.maximum-am-resource-percent</name>
>>     <value>1.0</value>
>>   </property>
>>
>>
>>  <property>
>>     <name>yarn.scheduler.capacity.root.small.user-limit</name>
>>     <value>1</value>
>>   </property>
>>
>>  <property>
>>     <name>yarn.scheduler.capacity.root.default.capacity</name>
>>     <value>88</value>
>>     <description>Default queue target capacity.</description>
>>   </property>
>>
>>
>>   <property>
>>     <name>yarn.scheduler.capacity.root.small.capacity</name>
>>     <value>12</value>
>>     <description>Default queue target capacity.</description>
>>   </property>
>>
>>  <property>
>>     <name>yarn.scheduler.capacity.root.default.maximum-capacity</name>
>>     <value>88</value>
>>     <description>
>>       The maximum capacity of the default queue.
>>     </description>
>>   </property>
>>
>>   <property>
>>     <name>yarn.scheduler.capacity.root.small.maximum-capacity</name>
>>     <value>12</value>
>>     <description>Maximum queue capacity.</description>
>>   </property>
>>
>>
>> Thanks
>>
>> --
>> --Anfernee
>>
>
>
>
> --
>
> --
> Arun C. Murthy
> Hortonworks Inc.
> http://hortonworks.com/
>
> CONFIDENTIALITY NOTICE
> NOTICE: This message is intended for the use of the individual or entity
> to which it is addressed and may contain information that is confidential,
> privileged and exempt from disclosure under applicable law. If the reader
> of this message is not the intended recipient, you are hereby notified that
> any printing, copying, dissemination, distribution, disclosure or
> forwarding of this communication is strictly prohibited. If you have
> received this communication in error, please contact the sender immediately
> and delete it from your system. Thank You.




-- 
--Anfernee

Re: In Yarn how to increase the number of concurrent applications for a queue

Posted by Anfernee Xu <an...@gmail.com>.

It turned out that it's not a configuration issue, some worker thread which
submits job to Yarn was blocked, see below thread dump

"pool-1-thread-160" id=194 idx=0x30c tid=886 prio=5 alive, blocked,
native_blocked
    -- Blocked trying to get lock:
org/apache/hadoop/ipc/Client$Connection@0x1059d0c60[thin lock]
    at __lll_lock_wait+36(:0)@0x340260d594
    at tsSleep+399(threadsystem.c:83)@0x2b2356e5da80
    at jrockit/vm/Threads.sleep(I)V(Native Method)
    at jrockit/vm/Locks.waitForThinRelease(Locks.java:955)[optimized]
    at
jrockit/vm/Locks.monitorEnterSecondStageHard(Locks.java:1083)[optimized]
    at jrockit/vm/Locks.monitorEnterSecondStage(Locks.java:1005)[optimized]
    at
org/apache/hadoop/ipc/Client$Connection.addCall(Client.java:400)[inlined]
    at
org/apache/hadoop/ipc/Client$Connection.access$2500(Client.java:314)[inlined]
    at
org/apache/hadoop/ipc/Client.getConnection(Client.java:1393)[optimized]
    at org/apache/hadoop/ipc/Client.call(Client.java:1318)[inlined]
    at org/apache/hadoop/ipc/Client.call(Client.java:1300)[inlined]
    at
org/apache/hadoop/ipc/ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206)[optimized]
    at
$Proxy21.getJobReport(Lcom/google/protobuf/RpcController;Lorg/apache/hadoop/mapreduce/v2/proto/MRServiceProtos$GetJobReportRequestProto;)Lorg/apache/hadoop/mapreduce/v2/proto/MRServiceProtos$GetJobReportResponseProto;(Unknown
Source)
    at
org/apache/hadoop/mapreduce/v2/api/impl/pb/client/MRClientProtocolPBClientImpl.getJobReport(MRClientProtocolPBClientImpl.java:133)[optimized]
    at
sun/reflect/GeneratedMethodAccessor79.invoke(Ljava/lang/Object;[Ljava/lang/Object;)Ljava/lang/Object;(Unknown
Source)[optimized]
    at
sun/reflect/DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)[optimized]
    at java/lang/reflect/Method.invoke(Method.java:597)[inlined]
    at
org/apache/hadoop/mapred/ClientServiceDelegate.invoke(ClientServiceDelegate.java:317)[inlined]
    at
org/apache/hadoop/mapred/ClientServiceDelegate.getJobStatus(ClientServiceDelegate.java:416)[optimized]
    ^-- Holding lock:
org/apache/hadoop/mapred/ClientServiceDelegate@0x10087d788[biased lock]
    at
org/apache/hadoop/mapred/YarnRunner.getJobStatus(TIEYarnRunner.java:522)[optimized]
    at org/apache/hadoop/mapreduce/Job$1.run(Job.java:314)[inlined]
    at org/apache/hadoop/mapreduce/Job$1.run(Job.java:311)[inlined]
    at
jrockit/vm/AccessController.doPrivileged(AccessController.java:254)[inlined]
    at javax/security/auth/Subject.doAs(Subject.java:396)[inlined]
    at
org/apache/hadoop/security/UserGroupInformation.doAs(UserGroupInformation.java:1491)[inlined]
    at org/apache/hadoop/mapreduce/Job.updateStatus(Job.java:311)[optimized]
    ^-- Holding lock: org/apache/hadoop/mapreduce/Job@0x100522fb8[biased
lock]
    at org/apache/hadoop/mapreduce/Job.isComplete(Job.java:599)
    at org/apache/hadoop/mapreduce/Job.waitForCompletion(Job.java:1294)

The lock was held by

"pool-1-thread-10" id=44 idx=0xb4 tid=736 prio=5 alive, sleeping,
native_waiting
    at pthread_cond_timedwait@@GLIBC_2.3.2+288(:0)@0x340260b1c0
    at eventTimedWaitNoTransitionImpl+46(event.c:93)@0x2b2356cc741f
    at
syncWaitForSignalNoTransition+133(synchronization.c:51)@0x2b2356e5a096
    at syncWaitForSignal+189(synchronization.c:85)@0x2b2356e5a1ae
    at vmtSleep+165(signaling.c:197)@0x2b2356e35ef6
    at JVM_Sleep+188(jvmthreads.c:119)@0x2b2356d6bb7d
    at java/lang/Thread.sleep(J)V(Native Method)
    at
org/apache/hadoop/ipc/Client$Connection.handleConnectionFailure(Client.java:778)[optimized]
    at
org/apache/hadoop/ipc/Client$Connection.setupConnection(Client.java:566)[optimized]
    ^-- Holding lock: org/apache/hadoop/ipc/Client$Connection@0x1059d0c60
[recursive]
    at
org/apache/hadoop/ipc/Client$Connection.setupIOstreams(Client.java:642)[optimized]
    ^-- Holding lock: org/apache/hadoop/ipc/Client$Connection@0x1059d0c60[thin
lock]
    at
org/apache/hadoop/ipc/Client$Connection.access$2600(Client.java:314)[inlined]
    at
org/apache/hadoop/ipc/Client.getConnection(Client.java:1399)[optimized]
    at org/apache/hadoop/ipc/Client.call(Client.java:1318)[inlined]
    at org/apache/hadoop/ipc/Client.call(Client.java:1300)[inlined]
    at
org/apache/hadoop/ipc/ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206)[optimized]
    at
$Proxy21.getJobReport(Lcom/google/protobuf/RpcController;Lorg/apache/hadoop/mapreduce/v2/proto/MRServiceProtos$GetJobReportRequestProto;)Lorg/apache/hadoop/mapreduce/v2/proto/MRServiceProtos$GetJobReportResponseProto;(Unknown
Source)
    at
org/apache/hadoop/mapreduce/v2/api/impl/pb/client/MRClientProtocolPBClientImpl.getJobReport(MRClientProtocolPBClientImpl.java:133)[optimized]
    at
sun/reflect/GeneratedMethodAccessor79.invoke(Ljava/lang/Object;[Ljava/lang/Object;)Ljava/lang/Object;(Unknown
Source)[optimized]
    at
sun/reflect/DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)[optimized]
    at java/lang/reflect/Method.invoke(Method.java:597)[inlined]
    at
org/apache/hadoop/mapred/ClientServiceDelegate.invoke(ClientServiceDelegate.java:317)[inlined]
    at
org/apache/hadoop/mapred/ClientServiceDelegate.getJobStatus(ClientServiceDelegate.java:416)[optimized]
    ^-- Holding lock:
org/apache/hadoop/mapred/ClientServiceDelegate@0x1012c34f8[biased lock]
    at
org/apache/hadoop/mapred/YarnRunner.getJobStatus(TIEYarnRunner.java:522)[optimized]
    at org/apache/hadoop/mapreduce/Job$1.run(Job.java:314)[inlined]
    at org/apache/hadoop/mapreduce/Job$1.run(Job.java:311)[inlined]
    at
jrockit/vm/AccessController.doPrivileged(AccessController.java:254)[inlined]
    at javax/security/auth/Subject.doAs(Subject.java:396)[inlined]
    at
org/apache/hadoop/security/UserGroupInformation.doAs(UserGroupInformation.java:1491)[inlined]
    at org/apache/hadoop/mapreduce/Job.updateStatus(Job.java:311)[optimized]
    ^-- Holding lock: org/apache/hadoop/mapreduce/Job@0x1016e05a8[biased
lock]
    at org/apache/hadoop/mapreduce/Job.isComplete(Job.java:599)
    at org/apache/hadoop/mapreduce/Job.waitForCompletion(Job.java:1294)

You can see the thead holding the lock is in sleep state and the calling
method is Connection.handleConnectionFailure(), so I checked the our log
file and realized the connection failure is about historyserver is not
available. In my case, I did not start historyserver at all, because it's
not needed(I disabled log-aggregation), so my question is why the job
client was still trying to talk to historyserver even log aggregation is
disabled.

Thanks



On Mon, Sep 8, 2014 at 3:57 AM, Arun Murthy <ac...@hortonworks.com> wrote:

> How many nodes do you have in your cluster?
>
> Also, could you share the CapacityScheduler initialization logs for each
> queue, such as:
>
> 2014-08-14 15:14:23,835 INFO
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler:
> Initialized queue: unfunded: capacity=0.5, absoluteCapacity=0.5,
> usedResources=<memory:0, vCores:0>, usedCapacity=0.0,
> absoluteUsedCapacity=0.0, numApps=0, numContainers=0
> 2014-08-14 15:14:23,840 INFO
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue:
> Initializing default
> capacity = 0.5 [= (float) configuredCapacity / 100 ]
> asboluteCapacity = 0.5 [= parentAbsoluteCapacity * capacity ]
> maxCapacity = 1.0 [= configuredMaxCapacity ]
> absoluteMaxCapacity = 1.0 [= 1.0 maximumCapacity undefined,
> (parentAbsoluteMaxCapacity * maximumCapacity) / 100 otherwise ]
> userLimit = 100 [= configuredUserLimit ]
> userLimitFactor = 1.0 [= configuredUserLimitFactor ]
> maxApplications = 5000 [= configuredMaximumSystemApplicationsPerQueue or
> (int)(configuredMaximumSystemApplications * absoluteCapacity)]
> maxApplicationsPerUser = 5000 [= (int)(maxApplications * (userLimit /
> 100.0f) * userLimitFactor) ]
> maxActiveApplications = 1 [= max((int)ceil((clusterResourceMemory /
> minimumAllocation) * maxAMResourcePerQueuePercent * absoluteMaxCapacity),1)
> ]
> maxActiveAppsUsingAbsCap = 1 [= max((int)ceil((clusterResourceMemory /
> minimumAllocation) *maxAMResourcePercent * absoluteCapacity),1) ]
> maxActiveApplicationsPerUser = 1 [= max((int)(maxActiveApplications *
> (userLimit / 100.0f) * userLimitFactor),1) ]
> usedCapacity = 0.0 [= usedResourcesMemory / (clusterResourceMemory *
> absoluteCapacity)]
> absoluteUsedCapacity = 0.0 [= usedResourcesMemory / clusterResourceMemory]
> maxAMResourcePerQueuePercent = 0.1 [= configuredMaximumAMResourcePercent ]
> minimumAllocationFactor = 0.87506104 [= (float)(maximumAllocationMemory -
> minimumAllocationMemory) / maximumAllocationMemory ]
> numContainers = 0 [= currentNumContainers ]
> state = RUNNING [= configuredState ]
> acls = SUBMIT_APPLICATIONS: ADMINISTER_QUEUE:  [= configuredAcls ]
> nodeLocalityDelay = 0
>
>
> Then, look at values for maxActiveAppsUsingAbsCap &
> maxActiveApplicationsPerUser. That should help debugging.
>
> thanks,
> Arun
>
>
> On Sun, Sep 7, 2014 at 9:37 AM, Anfernee Xu <an...@gmail.com> wrote:
>
>> Hi,
>>
>> I'rm running my cluster at Hadoop 2.2.0,  and use CapacityScheduler. And
>> all my jobs are uberized and running among 2 queues, one queue takes
>> majority of capacity(90%), another take 10%. What I found is for small
>> queue, only one job is running for a given time, I tried twisting below
>> properties, but no luck so far, could you guys share some light on this?
>>
>>  <property>
>>     <name>yarn.scheduler.capacity.maximum-am-resource-percent</name>
>>     <value>1.0</value>
>>     <description>
>>       Maximum percent of resources in the cluster which can be used to run
>>       application masters i.e. controls number of concurrent running
>>       applications.
>>     </description>
>>   </property>
>>
>>
>>  <property>
>>     <name>yarn.scheduler.capacity.root.queues</name>
>>     <value>default,small</value>
>>     <description>
>>       The queues at the this level (root is the root queue).
>>     </description>
>>   </property>
>>
>>  <property>
>>
>> <name>yarn.scheduler.capacity.root.small.maximum-am-resource-percent</name>
>>     <value>1.0</value>
>>   </property>
>>
>>
>>  <property>
>>     <name>yarn.scheduler.capacity.root.small.user-limit</name>
>>     <value>1</value>
>>   </property>
>>
>>  <property>
>>     <name>yarn.scheduler.capacity.root.default.capacity</name>
>>     <value>88</value>
>>     <description>Default queue target capacity.</description>
>>   </property>
>>
>>
>>   <property>
>>     <name>yarn.scheduler.capacity.root.small.capacity</name>
>>     <value>12</value>
>>     <description>Default queue target capacity.</description>
>>   </property>
>>
>>  <property>
>>     <name>yarn.scheduler.capacity.root.default.maximum-capacity</name>
>>     <value>88</value>
>>     <description>
>>       The maximum capacity of the default queue.
>>     </description>
>>   </property>
>>
>>   <property>
>>     <name>yarn.scheduler.capacity.root.small.maximum-capacity</name>
>>     <value>12</value>
>>     <description>Maximum queue capacity.</description>
>>   </property>
>>
>>
>> Thanks
>>
>> --
>> --Anfernee
>>
>
>
>
> --
>
> --
> Arun C. Murthy
> Hortonworks Inc.
> http://hortonworks.com/
>
> CONFIDENTIALITY NOTICE
> NOTICE: This message is intended for the use of the individual or entity
> to which it is addressed and may contain information that is confidential,
> privileged and exempt from disclosure under applicable law. If the reader
> of this message is not the intended recipient, you are hereby notified that
> any printing, copying, dissemination, distribution, disclosure or
> forwarding of this communication is strictly prohibited. If you have
> received this communication in error, please contact the sender immediately
> and delete it from your system. Thank You.




-- 
--Anfernee

Re: In Yarn how to increase the number of concurrent applications for a queue

Posted by Anfernee Xu <an...@gmail.com>.

It turned out that it's not a configuration issue, some worker thread which
submits job to Yarn was blocked, see below thread dump

"pool-1-thread-160" id=194 idx=0x30c tid=886 prio=5 alive, blocked,
native_blocked
    -- Blocked trying to get lock:
org/apache/hadoop/ipc/Client$Connection@0x1059d0c60[thin lock]
    at __lll_lock_wait+36(:0)@0x340260d594
    at tsSleep+399(threadsystem.c:83)@0x2b2356e5da80
    at jrockit/vm/Threads.sleep(I)V(Native Method)
    at jrockit/vm/Locks.waitForThinRelease(Locks.java:955)[optimized]
    at
jrockit/vm/Locks.monitorEnterSecondStageHard(Locks.java:1083)[optimized]
    at jrockit/vm/Locks.monitorEnterSecondStage(Locks.java:1005)[optimized]
    at
org/apache/hadoop/ipc/Client$Connection.addCall(Client.java:400)[inlined]
    at
org/apache/hadoop/ipc/Client$Connection.access$2500(Client.java:314)[inlined]
    at
org/apache/hadoop/ipc/Client.getConnection(Client.java:1393)[optimized]
    at org/apache/hadoop/ipc/Client.call(Client.java:1318)[inlined]
    at org/apache/hadoop/ipc/Client.call(Client.java:1300)[inlined]
    at
org/apache/hadoop/ipc/ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206)[optimized]
    at
$Proxy21.getJobReport(Lcom/google/protobuf/RpcController;Lorg/apache/hadoop/mapreduce/v2/proto/MRServiceProtos$GetJobReportRequestProto;)Lorg/apache/hadoop/mapreduce/v2/proto/MRServiceProtos$GetJobReportResponseProto;(Unknown
Source)
    at
org/apache/hadoop/mapreduce/v2/api/impl/pb/client/MRClientProtocolPBClientImpl.getJobReport(MRClientProtocolPBClientImpl.java:133)[optimized]
    at
sun/reflect/GeneratedMethodAccessor79.invoke(Ljava/lang/Object;[Ljava/lang/Object;)Ljava/lang/Object;(Unknown
Source)[optimized]
    at
sun/reflect/DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)[optimized]
    at java/lang/reflect/Method.invoke(Method.java:597)[inlined]
    at
org/apache/hadoop/mapred/ClientServiceDelegate.invoke(ClientServiceDelegate.java:317)[inlined]
    at
org/apache/hadoop/mapred/ClientServiceDelegate.getJobStatus(ClientServiceDelegate.java:416)[optimized]
    ^-- Holding lock:
org/apache/hadoop/mapred/ClientServiceDelegate@0x10087d788[biased lock]
    at
org/apache/hadoop/mapred/YarnRunner.getJobStatus(TIEYarnRunner.java:522)[optimized]
    at org/apache/hadoop/mapreduce/Job$1.run(Job.java:314)[inlined]
    at org/apache/hadoop/mapreduce/Job$1.run(Job.java:311)[inlined]
    at
jrockit/vm/AccessController.doPrivileged(AccessController.java:254)[inlined]
    at javax/security/auth/Subject.doAs(Subject.java:396)[inlined]
    at
org/apache/hadoop/security/UserGroupInformation.doAs(UserGroupInformation.java:1491)[inlined]
    at org/apache/hadoop/mapreduce/Job.updateStatus(Job.java:311)[optimized]
    ^-- Holding lock: org/apache/hadoop/mapreduce/Job@0x100522fb8[biased
lock]
    at org/apache/hadoop/mapreduce/Job.isComplete(Job.java:599)
    at org/apache/hadoop/mapreduce/Job.waitForCompletion(Job.java:1294)

The lock was held by

"pool-1-thread-10" id=44 idx=0xb4 tid=736 prio=5 alive, sleeping,
native_waiting
    at pthread_cond_timedwait@@GLIBC_2.3.2+288(:0)@0x340260b1c0
    at eventTimedWaitNoTransitionImpl+46(event.c:93)@0x2b2356cc741f
    at
syncWaitForSignalNoTransition+133(synchronization.c:51)@0x2b2356e5a096
    at syncWaitForSignal+189(synchronization.c:85)@0x2b2356e5a1ae
    at vmtSleep+165(signaling.c:197)@0x2b2356e35ef6
    at JVM_Sleep+188(jvmthreads.c:119)@0x2b2356d6bb7d
    at java/lang/Thread.sleep(J)V(Native Method)
    at
org/apache/hadoop/ipc/Client$Connection.handleConnectionFailure(Client.java:778)[optimized]
    at
org/apache/hadoop/ipc/Client$Connection.setupConnection(Client.java:566)[optimized]
    ^-- Holding lock: org/apache/hadoop/ipc/Client$Connection@0x1059d0c60
[recursive]
    at
org/apache/hadoop/ipc/Client$Connection.setupIOstreams(Client.java:642)[optimized]
    ^-- Holding lock: org/apache/hadoop/ipc/Client$Connection@0x1059d0c60[thin
lock]
    at
org/apache/hadoop/ipc/Client$Connection.access$2600(Client.java:314)[inlined]
    at
org/apache/hadoop/ipc/Client.getConnection(Client.java:1399)[optimized]
    at org/apache/hadoop/ipc/Client.call(Client.java:1318)[inlined]
    at org/apache/hadoop/ipc/Client.call(Client.java:1300)[inlined]
    at
org/apache/hadoop/ipc/ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206)[optimized]
    at
$Proxy21.getJobReport(Lcom/google/protobuf/RpcController;Lorg/apache/hadoop/mapreduce/v2/proto/MRServiceProtos$GetJobReportRequestProto;)Lorg/apache/hadoop/mapreduce/v2/proto/MRServiceProtos$GetJobReportResponseProto;(Unknown
Source)
    at
org/apache/hadoop/mapreduce/v2/api/impl/pb/client/MRClientProtocolPBClientImpl.getJobReport(MRClientProtocolPBClientImpl.java:133)[optimized]
    at
sun/reflect/GeneratedMethodAccessor79.invoke(Ljava/lang/Object;[Ljava/lang/Object;)Ljava/lang/Object;(Unknown
Source)[optimized]
    at
sun/reflect/DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)[optimized]
    at java/lang/reflect/Method.invoke(Method.java:597)[inlined]
    at
org/apache/hadoop/mapred/ClientServiceDelegate.invoke(ClientServiceDelegate.java:317)[inlined]
    at
org/apache/hadoop/mapred/ClientServiceDelegate.getJobStatus(ClientServiceDelegate.java:416)[optimized]
    ^-- Holding lock:
org/apache/hadoop/mapred/ClientServiceDelegate@0x1012c34f8[biased lock]
    at
org/apache/hadoop/mapred/YarnRunner.getJobStatus(TIEYarnRunner.java:522)[optimized]
    at org/apache/hadoop/mapreduce/Job$1.run(Job.java:314)[inlined]
    at org/apache/hadoop/mapreduce/Job$1.run(Job.java:311)[inlined]
    at
jrockit/vm/AccessController.doPrivileged(AccessController.java:254)[inlined]
    at javax/security/auth/Subject.doAs(Subject.java:396)[inlined]
    at
org/apache/hadoop/security/UserGroupInformation.doAs(UserGroupInformation.java:1491)[inlined]
    at org/apache/hadoop/mapreduce/Job.updateStatus(Job.java:311)[optimized]
    ^-- Holding lock: org/apache/hadoop/mapreduce/Job@0x1016e05a8[biased
lock]
    at org/apache/hadoop/mapreduce/Job.isComplete(Job.java:599)
    at org/apache/hadoop/mapreduce/Job.waitForCompletion(Job.java:1294)

You can see the thead holding the lock is in sleep state and the calling
method is Connection.handleConnectionFailure(), so I checked the our log
file and realized the connection failure is about historyserver is not
available. In my case, I did not start historyserver at all, because it's
not needed(I disabled log-aggregation), so my question is why the job
client was still trying to talk to historyserver even log aggregation is
disabled.

Thanks



On Mon, Sep 8, 2014 at 3:57 AM, Arun Murthy <ac...@hortonworks.com> wrote:

> How many nodes do you have in your cluster?
>
> Also, could you share the CapacityScheduler initialization logs for each
> queue, such as:
>
> 2014-08-14 15:14:23,835 INFO
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler:
> Initialized queue: unfunded: capacity=0.5, absoluteCapacity=0.5,
> usedResources=<memory:0, vCores:0>, usedCapacity=0.0,
> absoluteUsedCapacity=0.0, numApps=0, numContainers=0
> 2014-08-14 15:14:23,840 INFO
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue:
> Initializing default
> capacity = 0.5 [= (float) configuredCapacity / 100 ]
> asboluteCapacity = 0.5 [= parentAbsoluteCapacity * capacity ]
> maxCapacity = 1.0 [= configuredMaxCapacity ]
> absoluteMaxCapacity = 1.0 [= 1.0 maximumCapacity undefined,
> (parentAbsoluteMaxCapacity * maximumCapacity) / 100 otherwise ]
> userLimit = 100 [= configuredUserLimit ]
> userLimitFactor = 1.0 [= configuredUserLimitFactor ]
> maxApplications = 5000 [= configuredMaximumSystemApplicationsPerQueue or
> (int)(configuredMaximumSystemApplications * absoluteCapacity)]
> maxApplicationsPerUser = 5000 [= (int)(maxApplications * (userLimit /
> 100.0f) * userLimitFactor) ]
> maxActiveApplications = 1 [= max((int)ceil((clusterResourceMemory /
> minimumAllocation) * maxAMResourcePerQueuePercent * absoluteMaxCapacity),1)
> ]
> maxActiveAppsUsingAbsCap = 1 [= max((int)ceil((clusterResourceMemory /
> minimumAllocation) *maxAMResourcePercent * absoluteCapacity),1) ]
> maxActiveApplicationsPerUser = 1 [= max((int)(maxActiveApplications *
> (userLimit / 100.0f) * userLimitFactor),1) ]
> usedCapacity = 0.0 [= usedResourcesMemory / (clusterResourceMemory *
> absoluteCapacity)]
> absoluteUsedCapacity = 0.0 [= usedResourcesMemory / clusterResourceMemory]
> maxAMResourcePerQueuePercent = 0.1 [= configuredMaximumAMResourcePercent ]
> minimumAllocationFactor = 0.87506104 [= (float)(maximumAllocationMemory -
> minimumAllocationMemory) / maximumAllocationMemory ]
> numContainers = 0 [= currentNumContainers ]
> state = RUNNING [= configuredState ]
> acls = SUBMIT_APPLICATIONS: ADMINISTER_QUEUE:  [= configuredAcls ]
> nodeLocalityDelay = 0
>
>
> Then, look at values for maxActiveAppsUsingAbsCap &
> maxActiveApplicationsPerUser. That should help debugging.
>
> thanks,
> Arun
>
>
> On Sun, Sep 7, 2014 at 9:37 AM, Anfernee Xu <an...@gmail.com> wrote:
>
>> Hi,
>>
>> I'rm running my cluster at Hadoop 2.2.0,  and use CapacityScheduler. And
>> all my jobs are uberized and running among 2 queues, one queue takes
>> majority of capacity(90%), another take 10%. What I found is for small
>> queue, only one job is running for a given time, I tried twisting below
>> properties, but no luck so far, could you guys share some light on this?
>>
>>  <property>
>>     <name>yarn.scheduler.capacity.maximum-am-resource-percent</name>
>>     <value>1.0</value>
>>     <description>
>>       Maximum percent of resources in the cluster which can be used to run
>>       application masters i.e. controls number of concurrent running
>>       applications.
>>     </description>
>>   </property>
>>
>>
>>  <property>
>>     <name>yarn.scheduler.capacity.root.queues</name>
>>     <value>default,small</value>
>>     <description>
>>       The queues at the this level (root is the root queue).
>>     </description>
>>   </property>
>>
>>  <property>
>>
>> <name>yarn.scheduler.capacity.root.small.maximum-am-resource-percent</name>
>>     <value>1.0</value>
>>   </property>
>>
>>
>>  <property>
>>     <name>yarn.scheduler.capacity.root.small.user-limit</name>
>>     <value>1</value>
>>   </property>
>>
>>  <property>
>>     <name>yarn.scheduler.capacity.root.default.capacity</name>
>>     <value>88</value>
>>     <description>Default queue target capacity.</description>
>>   </property>
>>
>>
>>   <property>
>>     <name>yarn.scheduler.capacity.root.small.capacity</name>
>>     <value>12</value>
>>     <description>Default queue target capacity.</description>
>>   </property>
>>
>>  <property>
>>     <name>yarn.scheduler.capacity.root.default.maximum-capacity</name>
>>     <value>88</value>
>>     <description>
>>       The maximum capacity of the default queue.
>>     </description>
>>   </property>
>>
>>   <property>
>>     <name>yarn.scheduler.capacity.root.small.maximum-capacity</name>
>>     <value>12</value>
>>     <description>Maximum queue capacity.</description>
>>   </property>
>>
>>
>> Thanks
>>
>> --
>> --Anfernee
>>
>
>
>
> --
>
> --
> Arun C. Murthy
> Hortonworks Inc.
> http://hortonworks.com/
>
> CONFIDENTIALITY NOTICE
> NOTICE: This message is intended for the use of the individual or entity
> to which it is addressed and may contain information that is confidential,
> privileged and exempt from disclosure under applicable law. If the reader
> of this message is not the intended recipient, you are hereby notified that
> any printing, copying, dissemination, distribution, disclosure or
> forwarding of this communication is strictly prohibited. If you have
> received this communication in error, please contact the sender immediately
> and delete it from your system. Thank You.




-- 
--Anfernee

Re: In Yarn how to increase the number of concurrent applications for a queue

Posted by Arun Murthy <ac...@hortonworks.com>.

How many nodes do you have in your cluster?

Also, could you share the CapacityScheduler initialization logs for each
queue, such as:

2014-08-14 15:14:23,835 INFO
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler:
Initialized queue: unfunded: capacity=0.5, absoluteCapacity=0.5,
usedResources=<memory:0, vCores:0>, usedCapacity=0.0,
absoluteUsedCapacity=0.0, numApps=0, numContainers=0
2014-08-14 15:14:23,840 INFO
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue:
Initializing default
capacity = 0.5 [= (float) configuredCapacity / 100 ]
asboluteCapacity = 0.5 [= parentAbsoluteCapacity * capacity ]
maxCapacity = 1.0 [= configuredMaxCapacity ]
absoluteMaxCapacity = 1.0 [= 1.0 maximumCapacity undefined,
(parentAbsoluteMaxCapacity * maximumCapacity) / 100 otherwise ]
userLimit = 100 [= configuredUserLimit ]
userLimitFactor = 1.0 [= configuredUserLimitFactor ]
maxApplications = 5000 [= configuredMaximumSystemApplicationsPerQueue or
(int)(configuredMaximumSystemApplications * absoluteCapacity)]
maxApplicationsPerUser = 5000 [= (int)(maxApplications * (userLimit /
100.0f) * userLimitFactor) ]
maxActiveApplications = 1 [= max((int)ceil((clusterResourceMemory /
minimumAllocation) * maxAMResourcePerQueuePercent * absoluteMaxCapacity),1)
]
maxActiveAppsUsingAbsCap = 1 [= max((int)ceil((clusterResourceMemory /
minimumAllocation) *maxAMResourcePercent * absoluteCapacity),1) ]
maxActiveApplicationsPerUser = 1 [= max((int)(maxActiveApplications *
(userLimit / 100.0f) * userLimitFactor),1) ]
usedCapacity = 0.0 [= usedResourcesMemory / (clusterResourceMemory *
absoluteCapacity)]
absoluteUsedCapacity = 0.0 [= usedResourcesMemory / clusterResourceMemory]
maxAMResourcePerQueuePercent = 0.1 [= configuredMaximumAMResourcePercent ]
minimumAllocationFactor = 0.87506104 [= (float)(maximumAllocationMemory -
minimumAllocationMemory) / maximumAllocationMemory ]
numContainers = 0 [= currentNumContainers ]
state = RUNNING [= configuredState ]
acls = SUBMIT_APPLICATIONS: ADMINISTER_QUEUE:  [= configuredAcls ]
nodeLocalityDelay = 0


Then, look at values for maxActiveAppsUsingAbsCap &
maxActiveApplicationsPerUser. That should help debugging.

thanks,
Arun


On Sun, Sep 7, 2014 at 9:37 AM, Anfernee Xu <an...@gmail.com> wrote:

> Hi,
>
> I'rm running my cluster at Hadoop 2.2.0,  and use CapacityScheduler. And
> all my jobs are uberized and running among 2 queues, one queue takes
> majority of capacity(90%), another take 10%. What I found is for small
> queue, only one job is running for a given time, I tried twisting below
> properties, but no luck so far, could you guys share some light on this?
>
>  <property>
>     <name>yarn.scheduler.capacity.maximum-am-resource-percent</name>
>     <value>1.0</value>
>     <description>
>       Maximum percent of resources in the cluster which can be used to run
>       application masters i.e. controls number of concurrent running
>       applications.
>     </description>
>   </property>
>
>
>  <property>
>     <name>yarn.scheduler.capacity.root.queues</name>
>     <value>default,small</value>
>     <description>
>       The queues at the this level (root is the root queue).
>     </description>
>   </property>
>
>  <property>
>
> <name>yarn.scheduler.capacity.root.small.maximum-am-resource-percent</name>
>     <value>1.0</value>
>   </property>
>
>
>  <property>
>     <name>yarn.scheduler.capacity.root.small.user-limit</name>
>     <value>1</value>
>   </property>
>
>  <property>
>     <name>yarn.scheduler.capacity.root.default.capacity</name>
>     <value>88</value>
>     <description>Default queue target capacity.</description>
>   </property>
>
>
>   <property>
>     <name>yarn.scheduler.capacity.root.small.capacity</name>
>     <value>12</value>
>     <description>Default queue target capacity.</description>
>   </property>
>
>  <property>
>     <name>yarn.scheduler.capacity.root.default.maximum-capacity</name>
>     <value>88</value>
>     <description>
>       The maximum capacity of the default queue.
>     </description>
>   </property>
>
>   <property>
>     <name>yarn.scheduler.capacity.root.small.maximum-capacity</name>
>     <value>12</value>
>     <description>Maximum queue capacity.</description>
>   </property>
>
>
> Thanks
>
> --
> --Anfernee
>



-- 

--
Arun C. Murthy
Hortonworks Inc.
http://hortonworks.com/

-- 
CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to 
which it is addressed and may contain information that is confidential, 
privileged and exempt from disclosure under applicable law. If the reader 
of this message is not the intended recipient, you are hereby notified that 
any printing, copying, dissemination, distribution, disclosure or 
forwarding of this communication is strictly prohibited. If you have 
received this communication in error, please contact the sender immediately 
and delete it from your system. Thank You.

Re: In Yarn how to increase the number of concurrent applications for a queue

Posted by Arun Murthy <ac...@hortonworks.com>.

How many nodes do you have in your cluster?

Also, could you share the CapacityScheduler initialization logs for each
queue, such as:

2014-08-14 15:14:23,835 INFO
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler:
Initialized queue: unfunded: capacity=0.5, absoluteCapacity=0.5,
usedResources=<memory:0, vCores:0>, usedCapacity=0.0,
absoluteUsedCapacity=0.0, numApps=0, numContainers=0
2014-08-14 15:14:23,840 INFO
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue:
Initializing default
capacity = 0.5 [= (float) configuredCapacity / 100 ]
asboluteCapacity = 0.5 [= parentAbsoluteCapacity * capacity ]
maxCapacity = 1.0 [= configuredMaxCapacity ]
absoluteMaxCapacity = 1.0 [= 1.0 maximumCapacity undefined,
(parentAbsoluteMaxCapacity * maximumCapacity) / 100 otherwise ]
userLimit = 100 [= configuredUserLimit ]
userLimitFactor = 1.0 [= configuredUserLimitFactor ]
maxApplications = 5000 [= configuredMaximumSystemApplicationsPerQueue or
(int)(configuredMaximumSystemApplications * absoluteCapacity)]
maxApplicationsPerUser = 5000 [= (int)(maxApplications * (userLimit /
100.0f) * userLimitFactor) ]
maxActiveApplications = 1 [= max((int)ceil((clusterResourceMemory /
minimumAllocation) * maxAMResourcePerQueuePercent * absoluteMaxCapacity),1)
]
maxActiveAppsUsingAbsCap = 1 [= max((int)ceil((clusterResourceMemory /
minimumAllocation) *maxAMResourcePercent * absoluteCapacity),1) ]
maxActiveApplicationsPerUser = 1 [= max((int)(maxActiveApplications *
(userLimit / 100.0f) * userLimitFactor),1) ]
usedCapacity = 0.0 [= usedResourcesMemory / (clusterResourceMemory *
absoluteCapacity)]
absoluteUsedCapacity = 0.0 [= usedResourcesMemory / clusterResourceMemory]
maxAMResourcePerQueuePercent = 0.1 [= configuredMaximumAMResourcePercent ]
minimumAllocationFactor = 0.87506104 [= (float)(maximumAllocationMemory -
minimumAllocationMemory) / maximumAllocationMemory ]
numContainers = 0 [= currentNumContainers ]
state = RUNNING [= configuredState ]
acls = SUBMIT_APPLICATIONS: ADMINISTER_QUEUE:  [= configuredAcls ]
nodeLocalityDelay = 0


Then, look at values for maxActiveAppsUsingAbsCap &
maxActiveApplicationsPerUser. That should help debugging.

thanks,
Arun


On Sun, Sep 7, 2014 at 9:37 AM, Anfernee Xu <an...@gmail.com> wrote:

> Hi,
>
> I'rm running my cluster at Hadoop 2.2.0,  and use CapacityScheduler. And
> all my jobs are uberized and running among 2 queues, one queue takes
> majority of capacity(90%), another take 10%. What I found is for small
> queue, only one job is running for a given time, I tried twisting below
> properties, but no luck so far, could you guys share some light on this?
>
>  <property>
>     <name>yarn.scheduler.capacity.maximum-am-resource-percent</name>
>     <value>1.0</value>
>     <description>
>       Maximum percent of resources in the cluster which can be used to run
>       application masters i.e. controls number of concurrent running
>       applications.
>     </description>
>   </property>
>
>
>  <property>
>     <name>yarn.scheduler.capacity.root.queues</name>
>     <value>default,small</value>
>     <description>
>       The queues at the this level (root is the root queue).
>     </description>
>   </property>
>
>  <property>
>
> <name>yarn.scheduler.capacity.root.small.maximum-am-resource-percent</name>
>     <value>1.0</value>
>   </property>
>
>
>  <property>
>     <name>yarn.scheduler.capacity.root.small.user-limit</name>
>     <value>1</value>
>   </property>
>
>  <property>
>     <name>yarn.scheduler.capacity.root.default.capacity</name>
>     <value>88</value>
>     <description>Default queue target capacity.</description>
>   </property>
>
>
>   <property>
>     <name>yarn.scheduler.capacity.root.small.capacity</name>
>     <value>12</value>
>     <description>Default queue target capacity.</description>
>   </property>
>
>  <property>
>     <name>yarn.scheduler.capacity.root.default.maximum-capacity</name>
>     <value>88</value>
>     <description>
>       The maximum capacity of the default queue.
>     </description>
>   </property>
>
>   <property>
>     <name>yarn.scheduler.capacity.root.small.maximum-capacity</name>
>     <value>12</value>
>     <description>Maximum queue capacity.</description>
>   </property>
>
>
> Thanks
>
> --
> --Anfernee
>



-- 

--
Arun C. Murthy
Hortonworks Inc.
http://hortonworks.com/

-- 
CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to 
which it is addressed and may contain information that is confidential, 
privileged and exempt from disclosure under applicable law. If the reader 
of this message is not the intended recipient, you are hereby notified that 
any printing, copying, dissemination, distribution, disclosure or 
forwarding of this communication is strictly prohibited. If you have 
received this communication in error, please contact the sender immediately 
and delete it from your system. Thank You.

Re: In Yarn how to increase the number of concurrent applications for a queue

Posted by Arun Murthy <ac...@hortonworks.com>.

How many nodes do you have in your cluster?

Also, could you share the CapacityScheduler initialization logs for each
queue, such as:

2014-08-14 15:14:23,835 INFO
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler:
Initialized queue: unfunded: capacity=0.5, absoluteCapacity=0.5,
usedResources=<memory:0, vCores:0>, usedCapacity=0.0,
absoluteUsedCapacity=0.0, numApps=0, numContainers=0
2014-08-14 15:14:23,840 INFO
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue:
Initializing default
capacity = 0.5 [= (float) configuredCapacity / 100 ]
asboluteCapacity = 0.5 [= parentAbsoluteCapacity * capacity ]
maxCapacity = 1.0 [= configuredMaxCapacity ]
absoluteMaxCapacity = 1.0 [= 1.0 maximumCapacity undefined,
(parentAbsoluteMaxCapacity * maximumCapacity) / 100 otherwise ]
userLimit = 100 [= configuredUserLimit ]
userLimitFactor = 1.0 [= configuredUserLimitFactor ]
maxApplications = 5000 [= configuredMaximumSystemApplicationsPerQueue or
(int)(configuredMaximumSystemApplications * absoluteCapacity)]
maxApplicationsPerUser = 5000 [= (int)(maxApplications * (userLimit /
100.0f) * userLimitFactor) ]
maxActiveApplications = 1 [= max((int)ceil((clusterResourceMemory /
minimumAllocation) * maxAMResourcePerQueuePercent * absoluteMaxCapacity),1)
]
maxActiveAppsUsingAbsCap = 1 [= max((int)ceil((clusterResourceMemory /
minimumAllocation) *maxAMResourcePercent * absoluteCapacity),1) ]
maxActiveApplicationsPerUser = 1 [= max((int)(maxActiveApplications *
(userLimit / 100.0f) * userLimitFactor),1) ]
usedCapacity = 0.0 [= usedResourcesMemory / (clusterResourceMemory *
absoluteCapacity)]
absoluteUsedCapacity = 0.0 [= usedResourcesMemory / clusterResourceMemory]
maxAMResourcePerQueuePercent = 0.1 [= configuredMaximumAMResourcePercent ]
minimumAllocationFactor = 0.87506104 [= (float)(maximumAllocationMemory -
minimumAllocationMemory) / maximumAllocationMemory ]
numContainers = 0 [= currentNumContainers ]
state = RUNNING [= configuredState ]
acls = SUBMIT_APPLICATIONS: ADMINISTER_QUEUE:  [= configuredAcls ]
nodeLocalityDelay = 0


Then, look at values for maxActiveAppsUsingAbsCap &
maxActiveApplicationsPerUser. That should help debugging.

thanks,
Arun


On Sun, Sep 7, 2014 at 9:37 AM, Anfernee Xu <an...@gmail.com> wrote:

> Hi,
>
> I'rm running my cluster at Hadoop 2.2.0,  and use CapacityScheduler. And
> all my jobs are uberized and running among 2 queues, one queue takes
> majority of capacity(90%), another take 10%. What I found is for small
> queue, only one job is running for a given time, I tried twisting below
> properties, but no luck so far, could you guys share some light on this?
>
>  <property>
>     <name>yarn.scheduler.capacity.maximum-am-resource-percent</name>
>     <value>1.0</value>
>     <description>
>       Maximum percent of resources in the cluster which can be used to run
>       application masters i.e. controls number of concurrent running
>       applications.
>     </description>
>   </property>
>
>
>  <property>
>     <name>yarn.scheduler.capacity.root.queues</name>
>     <value>default,small</value>
>     <description>
>       The queues at the this level (root is the root queue).
>     </description>
>   </property>
>
>  <property>
>
> <name>yarn.scheduler.capacity.root.small.maximum-am-resource-percent</name>
>     <value>1.0</value>
>   </property>
>
>
>  <property>
>     <name>yarn.scheduler.capacity.root.small.user-limit</name>
>     <value>1</value>
>   </property>
>
>  <property>
>     <name>yarn.scheduler.capacity.root.default.capacity</name>
>     <value>88</value>
>     <description>Default queue target capacity.</description>
>   </property>
>
>
>   <property>
>     <name>yarn.scheduler.capacity.root.small.capacity</name>
>     <value>12</value>
>     <description>Default queue target capacity.</description>
>   </property>
>
>  <property>
>     <name>yarn.scheduler.capacity.root.default.maximum-capacity</name>
>     <value>88</value>
>     <description>
>       The maximum capacity of the default queue.
>     </description>
>   </property>
>
>   <property>
>     <name>yarn.scheduler.capacity.root.small.maximum-capacity</name>
>     <value>12</value>
>     <description>Maximum queue capacity.</description>
>   </property>
>
>
> Thanks
>
> --
> --Anfernee
>



-- 

--
Arun C. Murthy
Hortonworks Inc.
http://hortonworks.com/

-- 
CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to 
which it is addressed and may contain information that is confidential, 
privileged and exempt from disclosure under applicable law. If the reader 
of this message is not the intended recipient, you are hereby notified that 
any printing, copying, dissemination, distribution, disclosure or 
forwarding of this communication is strictly prohibited. If you have 
received this communication in error, please contact the sender immediately 
and delete it from your system. Thank You.

Re: In Yarn how to increase the number of concurrent applications for a queue

Posted by Arun Murthy <ac...@hortonworks.com>.

How many nodes do you have in your cluster?

Also, could you share the CapacityScheduler initialization logs for each
queue, such as:

2014-08-14 15:14:23,835 INFO
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler:
Initialized queue: unfunded: capacity=0.5, absoluteCapacity=0.5,
usedResources=<memory:0, vCores:0>, usedCapacity=0.0,
absoluteUsedCapacity=0.0, numApps=0, numContainers=0
2014-08-14 15:14:23,840 INFO
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue:
Initializing default
capacity = 0.5 [= (float) configuredCapacity / 100 ]
asboluteCapacity = 0.5 [= parentAbsoluteCapacity * capacity ]
maxCapacity = 1.0 [= configuredMaxCapacity ]
absoluteMaxCapacity = 1.0 [= 1.0 maximumCapacity undefined,
(parentAbsoluteMaxCapacity * maximumCapacity) / 100 otherwise ]
userLimit = 100 [= configuredUserLimit ]
userLimitFactor = 1.0 [= configuredUserLimitFactor ]
maxApplications = 5000 [= configuredMaximumSystemApplicationsPerQueue or
(int)(configuredMaximumSystemApplications * absoluteCapacity)]
maxApplicationsPerUser = 5000 [= (int)(maxApplications * (userLimit /
100.0f) * userLimitFactor) ]
maxActiveApplications = 1 [= max((int)ceil((clusterResourceMemory /
minimumAllocation) * maxAMResourcePerQueuePercent * absoluteMaxCapacity),1)
]
maxActiveAppsUsingAbsCap = 1 [= max((int)ceil((clusterResourceMemory /
minimumAllocation) *maxAMResourcePercent * absoluteCapacity),1) ]
maxActiveApplicationsPerUser = 1 [= max((int)(maxActiveApplications *
(userLimit / 100.0f) * userLimitFactor),1) ]
usedCapacity = 0.0 [= usedResourcesMemory / (clusterResourceMemory *
absoluteCapacity)]
absoluteUsedCapacity = 0.0 [= usedResourcesMemory / clusterResourceMemory]
maxAMResourcePerQueuePercent = 0.1 [= configuredMaximumAMResourcePercent ]
minimumAllocationFactor = 0.87506104 [= (float)(maximumAllocationMemory -
minimumAllocationMemory) / maximumAllocationMemory ]
numContainers = 0 [= currentNumContainers ]
state = RUNNING [= configuredState ]
acls = SUBMIT_APPLICATIONS: ADMINISTER_QUEUE:  [= configuredAcls ]
nodeLocalityDelay = 0


Then, look at values for maxActiveAppsUsingAbsCap &
maxActiveApplicationsPerUser. That should help debugging.

thanks,
Arun


On Sun, Sep 7, 2014 at 9:37 AM, Anfernee Xu <an...@gmail.com> wrote:

> Hi,
>
> I'rm running my cluster at Hadoop 2.2.0,  and use CapacityScheduler. And
> all my jobs are uberized and running among 2 queues, one queue takes
> majority of capacity(90%), another take 10%. What I found is for small
> queue, only one job is running for a given time, I tried twisting below
> properties, but no luck so far, could you guys share some light on this?
>
>  <property>
>     <name>yarn.scheduler.capacity.maximum-am-resource-percent</name>
>     <value>1.0</value>
>     <description>
>       Maximum percent of resources in the cluster which can be used to run
>       application masters i.e. controls number of concurrent running
>       applications.
>     </description>
>   </property>
>
>
>  <property>
>     <name>yarn.scheduler.capacity.root.queues</name>
>     <value>default,small</value>
>     <description>
>       The queues at the this level (root is the root queue).
>     </description>
>   </property>
>
>  <property>
>
> <name>yarn.scheduler.capacity.root.small.maximum-am-resource-percent</name>
>     <value>1.0</value>
>   </property>
>
>
>  <property>
>     <name>yarn.scheduler.capacity.root.small.user-limit</name>
>     <value>1</value>
>   </property>
>
>  <property>
>     <name>yarn.scheduler.capacity.root.default.capacity</name>
>     <value>88</value>
>     <description>Default queue target capacity.</description>
>   </property>
>
>
>   <property>
>     <name>yarn.scheduler.capacity.root.small.capacity</name>
>     <value>12</value>
>     <description>Default queue target capacity.</description>
>   </property>
>
>  <property>
>     <name>yarn.scheduler.capacity.root.default.maximum-capacity</name>
>     <value>88</value>
>     <description>
>       The maximum capacity of the default queue.
>     </description>
>   </property>
>
>   <property>
>     <name>yarn.scheduler.capacity.root.small.maximum-capacity</name>
>     <value>12</value>
>     <description>Maximum queue capacity.</description>
>   </property>
>
>
> Thanks
>
> --
> --Anfernee
>



-- 

--
Arun C. Murthy
Hortonworks Inc.
http://hortonworks.com/

-- 
CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to 
which it is addressed and may contain information that is confidential, 
privileged and exempt from disclosure under applicable law. If the reader 
of this message is not the intended recipient, you are hereby notified that 
any printing, copying, dissemination, distribution, disclosure or 
forwarding of this communication is strictly prohibited. If you have 
received this communication in error, please contact the sender immediately 
and delete it from your system. Thank You.