You are viewing a plain text version of this content. The canonical link for it is here.

Posted to mapreduce-user@hadoop.apache.org by Krishna Kishore Bonagiri <wr...@gmail.com> on 2013/09/12 15:15:25 UTC

Container allocation fails randomly

Hi,
  I am using 2.1.0-beta and have seen container allocation failing randomly
even when running the same application in a loop. I know that the cluster
has enough resources to give, because it gave the resources for the same
application all the other times in the loop and ran it successfully.

   I have observed a lot of the following kind of messages in the node
manager's log whenever such failure happens, any clues as to why it happens?

2013-09-12 08:54:36,204 INFO
org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Sending
out status for container: container_id { app_attempt_id { application_id {
id: 2 cluster_timestamp: 1378990400253 } attemptId: 1 } id: 1 } state:
C_RUNNING diagnostics: "" exit_status: -1000
2013-09-12 08:54:37,220 INFO
org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Sending
out status for container: container_id { app_attempt_id { application_id {
id: 2 cluster_timestamp: 1378990400253 } attemptId: 1 } id: 1 } state:
C_RUNNING diagnostics: "" exit_status: -1000
2013-09-12 08:54:38,231 INFO
org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Sending
out status for container: container_id { app_attempt_id { application_id {
id: 2 cluster_timestamp: 1378990400253 } attemptId: 1 } id: 1 } state:
C_RUNNING diagnostics: "" exit_status: -1000
2013-09-12 08:54:39,239 INFO
org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Sending
out status for container: container_id { app_attempt_id { application_id {
id: 2 cluster_timestamp: 1378990400253 } attemptId: 1 } id: 1 } state:
C_RUNNING diagnostics: "" exit_status: -1000
2013-09-12 08:54:40,267 INFO
org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Sending
out status for container: container_id { app_attempt_id { application_id {
id: 2 cluster_timestamp: 1378990400253 } attemptId: 1 } id: 1 } state:
C_RUNNING diagnostics: "" exit_status: -1000
2013-09-12 08:54:41,275 INFO
org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Sending
out status for container: container_id { app_attempt_id { application_id {
id: 2 cluster_timestamp: 1378990400253 } attemptId: 1 } id: 1 } state:
C_RUNNING diagnostics: "" exit_status: -1000
2013-09-12 08:54:42,283 INFO
org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Sending
out status for container: container_id { app_attempt_id { application_id {
id: 2 cluster_timestamp: 1378990400253 } attemptId: 1 } id: 1 } state:
C_RUNNING diagnostics: "" exit_status: -1000
2013-09-12 08:54:43,289 INFO
org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Sending
out status for container: container_id { app_attempt_id { application_id {
id: 2 cluster_timestamp: 1378990400253 } attemptId: 1 } id: 1 } state:
C_RUNNING diagnostics: "" exit_status: -1000


Thanks,
Kishore

Re: Container allocation fails randomly

Posted by Krishna Kishore Bonagiri <wr...@gmail.com>.

Hi Omkar,

  It is my own custom AM that I am using, not the MR-AM. But I am still not
able to believe how can a negative value go from the getProgress() call
which is always calculated from division of positive numbers, but might be
some floating point computation problems as you are saying.

Thanks,
Kishroe



On Thu, Sep 19, 2013 at 5:52 AM, Omkar Joshi <oj...@hortonworks.com> wrote:

> This is clearly an AM bug. are you using MR-AM or custom AM? you should
> check AM code which is computing progress. I suspect there must be some
> float computation problems. If it is an MR-AM problem then please file a
> map reduce bug.
>
> Thanks,
> Omkar Joshi
> *Hortonworks Inc.* <http://www.hortonworks.com>
>
>
> On Tue, Sep 17, 2013 at 2:47 AM, Krishna Kishore Bonagiri <
> write2kishore@gmail.com> wrote:
>
>> Hi Omkar,
>>
>>   Thanks for the quick reply, and sorry for not being able to get the
>> required logs that you have asked for.
>>
>>   But in the meanwhile I just wanted to check if you can get a clue with
>> the information I have now. I am seeing the following kind of error message
>> in AppMaster.stderr whenever this failure is happening. I don't know why
>> does it happen, the getProgress() call that I have implemented
>> in RMCallbackHandler could never return a negative value! Doesn't this
>> error mean that this getProgress() is giving a -ve value?
>>
>> Exception in thread "AMRM Heartbeater thread"
>> java.lang.IllegalArgumentException: Progress indicator should not be
>> negative
>>         at
>> com.google.common.base.Preconditions.checkArgument(Preconditions.java:88)
>>         at
>> org.apache.hadoop.yarn.client.api.impl.AMRMClientImpl.allocate(AMRMClientImpl.java:199)
>>         at
>> org.apache.hadoop.yarn.client.api.async.impl.AMRMClientAsyncImpl$HeartbeatThread.run(AMRMClientAsyncImpl.java:224)
>>
>> Thanks,
>> Kishore
>>
>>
>> On Fri, Sep 13, 2013 at 2:59 AM, Omkar Joshi <oj...@hortonworks.com>wrote:
>>
>>> Can you give more information? logs (complete) will help a lot around
>>> this time frame. Are the containers getting assigned via scheduler? is it
>>> failing when node manager tries to start container? I clearly see the
>>> diagnostic message is empty but do you see anything in NM logs? Also if
>>> there were running containers on the machine before launching new ones..
>>> then are they killed? or they are still hanging around? can you also try
>>> applying patch "https://issues.apache.org/jira/browse/YARN-1053" ? and
>>> check if you can see any message?
>>>
>>> Thanks,
>>> Omkar Joshi
>>> *Hortonworks Inc.* <http://www.hortonworks.com>
>>>
>>>
>>> On Thu, Sep 12, 2013 at 6:15 AM, Krishna Kishore Bonagiri <
>>> write2kishore@gmail.com> wrote:
>>>
>>>> Hi,
>>>>   I am using 2.1.0-beta and have seen container allocation failing
>>>> randomly even when running the same application in a loop. I know that the
>>>> cluster has enough resources to give, because it gave the resources for the
>>>> same application all the other times in the loop and ran it successfully.
>>>>
>>>>    I have observed a lot of the following kind of messages in the node
>>>> manager's log whenever such failure happens, any clues as to why it happens?
>>>>
>>>> 2013-09-12 08:54:36,204 INFO
>>>> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Sending
>>>> out status for container: container_id { app_attempt_id { application_id {
>>>> id: 2 cluster_timestamp: 1378990400253 } attemptId: 1 } id: 1 } state:
>>>> C_RUNNING diagnostics: "" exit_status: -1000
>>>> 2013-09-12 08:54:37,220 INFO
>>>> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Sending
>>>> out status for container: container_id { app_attempt_id { application_id {
>>>> id: 2 cluster_timestamp: 1378990400253 } attemptId: 1 } id: 1 } state:
>>>> C_RUNNING diagnostics: "" exit_status: -1000
>>>> 2013-09-12 08:54:38,231 INFO
>>>> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Sending
>>>> out status for container: container_id { app_attempt_id { application_id {
>>>> id: 2 cluster_timestamp: 1378990400253 } attemptId: 1 } id: 1 } state:
>>>> C_RUNNING diagnostics: "" exit_status: -1000
>>>> 2013-09-12 08:54:39,239 INFO
>>>> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Sending
>>>> out status for container: container_id { app_attempt_id { application_id {
>>>> id: 2 cluster_timestamp: 1378990400253 } attemptId: 1 } id: 1 } state:
>>>> C_RUNNING diagnostics: "" exit_status: -1000
>>>> 2013-09-12 08:54:40,267 INFO
>>>> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Sending
>>>> out status for container: container_id { app_attempt_id { application_id {
>>>> id: 2 cluster_timestamp: 1378990400253 } attemptId: 1 } id: 1 } state:
>>>> C_RUNNING diagnostics: "" exit_status: -1000
>>>> 2013-09-12 08:54:41,275 INFO
>>>> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Sending
>>>> out status for container: container_id { app_attempt_id { application_id {
>>>> id: 2 cluster_timestamp: 1378990400253 } attemptId: 1 } id: 1 } state:
>>>> C_RUNNING diagnostics: "" exit_status: -1000
>>>> 2013-09-12 08:54:42,283 INFO
>>>> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Sending
>>>> out status for container: container_id { app_attempt_id { application_id {
>>>> id: 2 cluster_timestamp: 1378990400253 } attemptId: 1 } id: 1 } state:
>>>> C_RUNNING diagnostics: "" exit_status: -1000
>>>> 2013-09-12 08:54:43,289 INFO
>>>> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Sending
>>>> out status for container: container_id { app_attempt_id { application_id {
>>>> id: 2 cluster_timestamp: 1378990400253 } attemptId: 1 } id: 1 } state:
>>>> C_RUNNING diagnostics: "" exit_status: -1000
>>>>
>>>>
>>>> Thanks,
>>>> Kishore
>>>>
>>>
>>>
>>> CONFIDENTIALITY NOTICE
>>> NOTICE: This message is intended for the use of the individual or entity
>>> to which it is addressed and may contain information that is confidential,
>>> privileged and exempt from disclosure under applicable law. If the reader
>>> of this message is not the intended recipient, you are hereby notified that
>>> any printing, copying, dissemination, distribution, disclosure or
>>> forwarding of this communication is strictly prohibited. If you have
>>> received this communication in error, please contact the sender immediately
>>> and delete it from your system. Thank You.
>>
>>
>>
>
> CONFIDENTIALITY NOTICE
> NOTICE: This message is intended for the use of the individual or entity
> to which it is addressed and may contain information that is confidential,
> privileged and exempt from disclosure under applicable law. If the reader
> of this message is not the intended recipient, you are hereby notified that
> any printing, copying, dissemination, distribution, disclosure or
> forwarding of this communication is strictly prohibited. If you have
> received this communication in error, please contact the sender immediately
> and delete it from your system. Thank You.
>

Re: Container allocation fails randomly

Posted by Krishna Kishore Bonagiri <wr...@gmail.com>.

Hi Omkar,

  It is my own custom AM that I am using, not the MR-AM. But I am still not
able to believe how can a negative value go from the getProgress() call
which is always calculated from division of positive numbers, but might be
some floating point computation problems as you are saying.

Thanks,
Kishroe



On Thu, Sep 19, 2013 at 5:52 AM, Omkar Joshi <oj...@hortonworks.com> wrote:

> This is clearly an AM bug. are you using MR-AM or custom AM? you should
> check AM code which is computing progress. I suspect there must be some
> float computation problems. If it is an MR-AM problem then please file a
> map reduce bug.
>
> Thanks,
> Omkar Joshi
> *Hortonworks Inc.* <http://www.hortonworks.com>
>
>
> On Tue, Sep 17, 2013 at 2:47 AM, Krishna Kishore Bonagiri <
> write2kishore@gmail.com> wrote:
>
>> Hi Omkar,
>>
>>   Thanks for the quick reply, and sorry for not being able to get the
>> required logs that you have asked for.
>>
>>   But in the meanwhile I just wanted to check if you can get a clue with
>> the information I have now. I am seeing the following kind of error message
>> in AppMaster.stderr whenever this failure is happening. I don't know why
>> does it happen, the getProgress() call that I have implemented
>> in RMCallbackHandler could never return a negative value! Doesn't this
>> error mean that this getProgress() is giving a -ve value?
>>
>> Exception in thread "AMRM Heartbeater thread"
>> java.lang.IllegalArgumentException: Progress indicator should not be
>> negative
>>         at
>> com.google.common.base.Preconditions.checkArgument(Preconditions.java:88)
>>         at
>> org.apache.hadoop.yarn.client.api.impl.AMRMClientImpl.allocate(AMRMClientImpl.java:199)
>>         at
>> org.apache.hadoop.yarn.client.api.async.impl.AMRMClientAsyncImpl$HeartbeatThread.run(AMRMClientAsyncImpl.java:224)
>>
>> Thanks,
>> Kishore
>>
>>
>> On Fri, Sep 13, 2013 at 2:59 AM, Omkar Joshi <oj...@hortonworks.com>wrote:
>>
>>> Can you give more information? logs (complete) will help a lot around
>>> this time frame. Are the containers getting assigned via scheduler? is it
>>> failing when node manager tries to start container? I clearly see the
>>> diagnostic message is empty but do you see anything in NM logs? Also if
>>> there were running containers on the machine before launching new ones..
>>> then are they killed? or they are still hanging around? can you also try
>>> applying patch "https://issues.apache.org/jira/browse/YARN-1053" ? and
>>> check if you can see any message?
>>>
>>> Thanks,
>>> Omkar Joshi
>>> *Hortonworks Inc.* <http://www.hortonworks.com>
>>>
>>>
>>> On Thu, Sep 12, 2013 at 6:15 AM, Krishna Kishore Bonagiri <
>>> write2kishore@gmail.com> wrote:
>>>
>>>> Hi,
>>>>   I am using 2.1.0-beta and have seen container allocation failing
>>>> randomly even when running the same application in a loop. I know that the
>>>> cluster has enough resources to give, because it gave the resources for the
>>>> same application all the other times in the loop and ran it successfully.
>>>>
>>>>    I have observed a lot of the following kind of messages in the node
>>>> manager's log whenever such failure happens, any clues as to why it happens?
>>>>
>>>> 2013-09-12 08:54:36,204 INFO
>>>> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Sending
>>>> out status for container: container_id { app_attempt_id { application_id {
>>>> id: 2 cluster_timestamp: 1378990400253 } attemptId: 1 } id: 1 } state:
>>>> C_RUNNING diagnostics: "" exit_status: -1000
>>>> 2013-09-12 08:54:37,220 INFO
>>>> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Sending
>>>> out status for container: container_id { app_attempt_id { application_id {
>>>> id: 2 cluster_timestamp: 1378990400253 } attemptId: 1 } id: 1 } state:
>>>> C_RUNNING diagnostics: "" exit_status: -1000
>>>> 2013-09-12 08:54:38,231 INFO
>>>> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Sending
>>>> out status for container: container_id { app_attempt_id { application_id {
>>>> id: 2 cluster_timestamp: 1378990400253 } attemptId: 1 } id: 1 } state:
>>>> C_RUNNING diagnostics: "" exit_status: -1000
>>>> 2013-09-12 08:54:39,239 INFO
>>>> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Sending
>>>> out status for container: container_id { app_attempt_id { application_id {
>>>> id: 2 cluster_timestamp: 1378990400253 } attemptId: 1 } id: 1 } state:
>>>> C_RUNNING diagnostics: "" exit_status: -1000
>>>> 2013-09-12 08:54:40,267 INFO
>>>> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Sending
>>>> out status for container: container_id { app_attempt_id { application_id {
>>>> id: 2 cluster_timestamp: 1378990400253 } attemptId: 1 } id: 1 } state:
>>>> C_RUNNING diagnostics: "" exit_status: -1000
>>>> 2013-09-12 08:54:41,275 INFO
>>>> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Sending
>>>> out status for container: container_id { app_attempt_id { application_id {
>>>> id: 2 cluster_timestamp: 1378990400253 } attemptId: 1 } id: 1 } state:
>>>> C_RUNNING diagnostics: "" exit_status: -1000
>>>> 2013-09-12 08:54:42,283 INFO
>>>> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Sending
>>>> out status for container: container_id { app_attempt_id { application_id {
>>>> id: 2 cluster_timestamp: 1378990400253 } attemptId: 1 } id: 1 } state:
>>>> C_RUNNING diagnostics: "" exit_status: -1000
>>>> 2013-09-12 08:54:43,289 INFO
>>>> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Sending
>>>> out status for container: container_id { app_attempt_id { application_id {
>>>> id: 2 cluster_timestamp: 1378990400253 } attemptId: 1 } id: 1 } state:
>>>> C_RUNNING diagnostics: "" exit_status: -1000
>>>>
>>>>
>>>> Thanks,
>>>> Kishore
>>>>
>>>
>>>
>>> CONFIDENTIALITY NOTICE
>>> NOTICE: This message is intended for the use of the individual or entity
>>> to which it is addressed and may contain information that is confidential,
>>> privileged and exempt from disclosure under applicable law. If the reader
>>> of this message is not the intended recipient, you are hereby notified that
>>> any printing, copying, dissemination, distribution, disclosure or
>>> forwarding of this communication is strictly prohibited. If you have
>>> received this communication in error, please contact the sender immediately
>>> and delete it from your system. Thank You.
>>
>>
>>
>
> CONFIDENTIALITY NOTICE
> NOTICE: This message is intended for the use of the individual or entity
> to which it is addressed and may contain information that is confidential,
> privileged and exempt from disclosure under applicable law. If the reader
> of this message is not the intended recipient, you are hereby notified that
> any printing, copying, dissemination, distribution, disclosure or
> forwarding of this communication is strictly prohibited. If you have
> received this communication in error, please contact the sender immediately
> and delete it from your system. Thank You.
>

Re: Container allocation fails randomly

Posted by Krishna Kishore Bonagiri <wr...@gmail.com>.

Hi Omkar,

  It is my own custom AM that I am using, not the MR-AM. But I am still not
able to believe how can a negative value go from the getProgress() call
which is always calculated from division of positive numbers, but might be
some floating point computation problems as you are saying.

Thanks,
Kishroe



On Thu, Sep 19, 2013 at 5:52 AM, Omkar Joshi <oj...@hortonworks.com> wrote:

> This is clearly an AM bug. are you using MR-AM or custom AM? you should
> check AM code which is computing progress. I suspect there must be some
> float computation problems. If it is an MR-AM problem then please file a
> map reduce bug.
>
> Thanks,
> Omkar Joshi
> *Hortonworks Inc.* <http://www.hortonworks.com>
>
>
> On Tue, Sep 17, 2013 at 2:47 AM, Krishna Kishore Bonagiri <
> write2kishore@gmail.com> wrote:
>
>> Hi Omkar,
>>
>>   Thanks for the quick reply, and sorry for not being able to get the
>> required logs that you have asked for.
>>
>>   But in the meanwhile I just wanted to check if you can get a clue with
>> the information I have now. I am seeing the following kind of error message
>> in AppMaster.stderr whenever this failure is happening. I don't know why
>> does it happen, the getProgress() call that I have implemented
>> in RMCallbackHandler could never return a negative value! Doesn't this
>> error mean that this getProgress() is giving a -ve value?
>>
>> Exception in thread "AMRM Heartbeater thread"
>> java.lang.IllegalArgumentException: Progress indicator should not be
>> negative
>>         at
>> com.google.common.base.Preconditions.checkArgument(Preconditions.java:88)
>>         at
>> org.apache.hadoop.yarn.client.api.impl.AMRMClientImpl.allocate(AMRMClientImpl.java:199)
>>         at
>> org.apache.hadoop.yarn.client.api.async.impl.AMRMClientAsyncImpl$HeartbeatThread.run(AMRMClientAsyncImpl.java:224)
>>
>> Thanks,
>> Kishore
>>
>>
>> On Fri, Sep 13, 2013 at 2:59 AM, Omkar Joshi <oj...@hortonworks.com>wrote:
>>
>>> Can you give more information? logs (complete) will help a lot around
>>> this time frame. Are the containers getting assigned via scheduler? is it
>>> failing when node manager tries to start container? I clearly see the
>>> diagnostic message is empty but do you see anything in NM logs? Also if
>>> there were running containers on the machine before launching new ones..
>>> then are they killed? or they are still hanging around? can you also try
>>> applying patch "https://issues.apache.org/jira/browse/YARN-1053" ? and
>>> check if you can see any message?
>>>
>>> Thanks,
>>> Omkar Joshi
>>> *Hortonworks Inc.* <http://www.hortonworks.com>
>>>
>>>
>>> On Thu, Sep 12, 2013 at 6:15 AM, Krishna Kishore Bonagiri <
>>> write2kishore@gmail.com> wrote:
>>>
>>>> Hi,
>>>>   I am using 2.1.0-beta and have seen container allocation failing
>>>> randomly even when running the same application in a loop. I know that the
>>>> cluster has enough resources to give, because it gave the resources for the
>>>> same application all the other times in the loop and ran it successfully.
>>>>
>>>>    I have observed a lot of the following kind of messages in the node
>>>> manager's log whenever such failure happens, any clues as to why it happens?
>>>>
>>>> 2013-09-12 08:54:36,204 INFO
>>>> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Sending
>>>> out status for container: container_id { app_attempt_id { application_id {
>>>> id: 2 cluster_timestamp: 1378990400253 } attemptId: 1 } id: 1 } state:
>>>> C_RUNNING diagnostics: "" exit_status: -1000
>>>> 2013-09-12 08:54:37,220 INFO
>>>> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Sending
>>>> out status for container: container_id { app_attempt_id { application_id {
>>>> id: 2 cluster_timestamp: 1378990400253 } attemptId: 1 } id: 1 } state:
>>>> C_RUNNING diagnostics: "" exit_status: -1000
>>>> 2013-09-12 08:54:38,231 INFO
>>>> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Sending
>>>> out status for container: container_id { app_attempt_id { application_id {
>>>> id: 2 cluster_timestamp: 1378990400253 } attemptId: 1 } id: 1 } state:
>>>> C_RUNNING diagnostics: "" exit_status: -1000
>>>> 2013-09-12 08:54:39,239 INFO
>>>> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Sending
>>>> out status for container: container_id { app_attempt_id { application_id {
>>>> id: 2 cluster_timestamp: 1378990400253 } attemptId: 1 } id: 1 } state:
>>>> C_RUNNING diagnostics: "" exit_status: -1000
>>>> 2013-09-12 08:54:40,267 INFO
>>>> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Sending
>>>> out status for container: container_id { app_attempt_id { application_id {
>>>> id: 2 cluster_timestamp: 1378990400253 } attemptId: 1 } id: 1 } state:
>>>> C_RUNNING diagnostics: "" exit_status: -1000
>>>> 2013-09-12 08:54:41,275 INFO
>>>> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Sending
>>>> out status for container: container_id { app_attempt_id { application_id {
>>>> id: 2 cluster_timestamp: 1378990400253 } attemptId: 1 } id: 1 } state:
>>>> C_RUNNING diagnostics: "" exit_status: -1000
>>>> 2013-09-12 08:54:42,283 INFO
>>>> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Sending
>>>> out status for container: container_id { app_attempt_id { application_id {
>>>> id: 2 cluster_timestamp: 1378990400253 } attemptId: 1 } id: 1 } state:
>>>> C_RUNNING diagnostics: "" exit_status: -1000
>>>> 2013-09-12 08:54:43,289 INFO
>>>> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Sending
>>>> out status for container: container_id { app_attempt_id { application_id {
>>>> id: 2 cluster_timestamp: 1378990400253 } attemptId: 1 } id: 1 } state:
>>>> C_RUNNING diagnostics: "" exit_status: -1000
>>>>
>>>>
>>>> Thanks,
>>>> Kishore
>>>>
>>>
>>>
>>> CONFIDENTIALITY NOTICE
>>> NOTICE: This message is intended for the use of the individual or entity
>>> to which it is addressed and may contain information that is confidential,
>>> privileged and exempt from disclosure under applicable law. If the reader
>>> of this message is not the intended recipient, you are hereby notified that
>>> any printing, copying, dissemination, distribution, disclosure or
>>> forwarding of this communication is strictly prohibited. If you have
>>> received this communication in error, please contact the sender immediately
>>> and delete it from your system. Thank You.
>>
>>
>>
>
> CONFIDENTIALITY NOTICE
> NOTICE: This message is intended for the use of the individual or entity
> to which it is addressed and may contain information that is confidential,
> privileged and exempt from disclosure under applicable law. If the reader
> of this message is not the intended recipient, you are hereby notified that
> any printing, copying, dissemination, distribution, disclosure or
> forwarding of this communication is strictly prohibited. If you have
> received this communication in error, please contact the sender immediately
> and delete it from your system. Thank You.
>

Re: Container allocation fails randomly

Posted by Krishna Kishore Bonagiri <wr...@gmail.com>.

Hi Omkar,

  It is my own custom AM that I am using, not the MR-AM. But I am still not
able to believe how can a negative value go from the getProgress() call
which is always calculated from division of positive numbers, but might be
some floating point computation problems as you are saying.

Thanks,
Kishroe



On Thu, Sep 19, 2013 at 5:52 AM, Omkar Joshi <oj...@hortonworks.com> wrote:

> This is clearly an AM bug. are you using MR-AM or custom AM? you should
> check AM code which is computing progress. I suspect there must be some
> float computation problems. If it is an MR-AM problem then please file a
> map reduce bug.
>
> Thanks,
> Omkar Joshi
> *Hortonworks Inc.* <http://www.hortonworks.com>
>
>
> On Tue, Sep 17, 2013 at 2:47 AM, Krishna Kishore Bonagiri <
> write2kishore@gmail.com> wrote:
>
>> Hi Omkar,
>>
>>   Thanks for the quick reply, and sorry for not being able to get the
>> required logs that you have asked for.
>>
>>   But in the meanwhile I just wanted to check if you can get a clue with
>> the information I have now. I am seeing the following kind of error message
>> in AppMaster.stderr whenever this failure is happening. I don't know why
>> does it happen, the getProgress() call that I have implemented
>> in RMCallbackHandler could never return a negative value! Doesn't this
>> error mean that this getProgress() is giving a -ve value?
>>
>> Exception in thread "AMRM Heartbeater thread"
>> java.lang.IllegalArgumentException: Progress indicator should not be
>> negative
>>         at
>> com.google.common.base.Preconditions.checkArgument(Preconditions.java:88)
>>         at
>> org.apache.hadoop.yarn.client.api.impl.AMRMClientImpl.allocate(AMRMClientImpl.java:199)
>>         at
>> org.apache.hadoop.yarn.client.api.async.impl.AMRMClientAsyncImpl$HeartbeatThread.run(AMRMClientAsyncImpl.java:224)
>>
>> Thanks,
>> Kishore
>>
>>
>> On Fri, Sep 13, 2013 at 2:59 AM, Omkar Joshi <oj...@hortonworks.com>wrote:
>>
>>> Can you give more information? logs (complete) will help a lot around
>>> this time frame. Are the containers getting assigned via scheduler? is it
>>> failing when node manager tries to start container? I clearly see the
>>> diagnostic message is empty but do you see anything in NM logs? Also if
>>> there were running containers on the machine before launching new ones..
>>> then are they killed? or they are still hanging around? can you also try
>>> applying patch "https://issues.apache.org/jira/browse/YARN-1053" ? and
>>> check if you can see any message?
>>>
>>> Thanks,
>>> Omkar Joshi
>>> *Hortonworks Inc.* <http://www.hortonworks.com>
>>>
>>>
>>> On Thu, Sep 12, 2013 at 6:15 AM, Krishna Kishore Bonagiri <
>>> write2kishore@gmail.com> wrote:
>>>
>>>> Hi,
>>>>   I am using 2.1.0-beta and have seen container allocation failing
>>>> randomly even when running the same application in a loop. I know that the
>>>> cluster has enough resources to give, because it gave the resources for the
>>>> same application all the other times in the loop and ran it successfully.
>>>>
>>>>    I have observed a lot of the following kind of messages in the node
>>>> manager's log whenever such failure happens, any clues as to why it happens?
>>>>
>>>> 2013-09-12 08:54:36,204 INFO
>>>> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Sending
>>>> out status for container: container_id { app_attempt_id { application_id {
>>>> id: 2 cluster_timestamp: 1378990400253 } attemptId: 1 } id: 1 } state:
>>>> C_RUNNING diagnostics: "" exit_status: -1000
>>>> 2013-09-12 08:54:37,220 INFO
>>>> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Sending
>>>> out status for container: container_id { app_attempt_id { application_id {
>>>> id: 2 cluster_timestamp: 1378990400253 } attemptId: 1 } id: 1 } state:
>>>> C_RUNNING diagnostics: "" exit_status: -1000
>>>> 2013-09-12 08:54:38,231 INFO
>>>> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Sending
>>>> out status for container: container_id { app_attempt_id { application_id {
>>>> id: 2 cluster_timestamp: 1378990400253 } attemptId: 1 } id: 1 } state:
>>>> C_RUNNING diagnostics: "" exit_status: -1000
>>>> 2013-09-12 08:54:39,239 INFO
>>>> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Sending
>>>> out status for container: container_id { app_attempt_id { application_id {
>>>> id: 2 cluster_timestamp: 1378990400253 } attemptId: 1 } id: 1 } state:
>>>> C_RUNNING diagnostics: "" exit_status: -1000
>>>> 2013-09-12 08:54:40,267 INFO
>>>> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Sending
>>>> out status for container: container_id { app_attempt_id { application_id {
>>>> id: 2 cluster_timestamp: 1378990400253 } attemptId: 1 } id: 1 } state:
>>>> C_RUNNING diagnostics: "" exit_status: -1000
>>>> 2013-09-12 08:54:41,275 INFO
>>>> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Sending
>>>> out status for container: container_id { app_attempt_id { application_id {
>>>> id: 2 cluster_timestamp: 1378990400253 } attemptId: 1 } id: 1 } state:
>>>> C_RUNNING diagnostics: "" exit_status: -1000
>>>> 2013-09-12 08:54:42,283 INFO
>>>> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Sending
>>>> out status for container: container_id { app_attempt_id { application_id {
>>>> id: 2 cluster_timestamp: 1378990400253 } attemptId: 1 } id: 1 } state:
>>>> C_RUNNING diagnostics: "" exit_status: -1000
>>>> 2013-09-12 08:54:43,289 INFO
>>>> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Sending
>>>> out status for container: container_id { app_attempt_id { application_id {
>>>> id: 2 cluster_timestamp: 1378990400253 } attemptId: 1 } id: 1 } state:
>>>> C_RUNNING diagnostics: "" exit_status: -1000
>>>>
>>>>
>>>> Thanks,
>>>> Kishore
>>>>
>>>
>>>
>>> CONFIDENTIALITY NOTICE
>>> NOTICE: This message is intended for the use of the individual or entity
>>> to which it is addressed and may contain information that is confidential,
>>> privileged and exempt from disclosure under applicable law. If the reader
>>> of this message is not the intended recipient, you are hereby notified that
>>> any printing, copying, dissemination, distribution, disclosure or
>>> forwarding of this communication is strictly prohibited. If you have
>>> received this communication in error, please contact the sender immediately
>>> and delete it from your system. Thank You.
>>
>>
>>
>
> CONFIDENTIALITY NOTICE
> NOTICE: This message is intended for the use of the individual or entity
> to which it is addressed and may contain information that is confidential,
> privileged and exempt from disclosure under applicable law. If the reader
> of this message is not the intended recipient, you are hereby notified that
> any printing, copying, dissemination, distribution, disclosure or
> forwarding of this communication is strictly prohibited. If you have
> received this communication in error, please contact the sender immediately
> and delete it from your system. Thank You.
>

Re: Container allocation fails randomly

Posted by Omkar Joshi <oj...@hortonworks.com>.

This is clearly an AM bug. are you using MR-AM or custom AM? you should
check AM code which is computing progress. I suspect there must be some
float computation problems. If it is an MR-AM problem then please file a
map reduce bug.

Thanks,
Omkar Joshi
*Hortonworks Inc.* <http://www.hortonworks.com>


On Tue, Sep 17, 2013 at 2:47 AM, Krishna Kishore Bonagiri <
write2kishore@gmail.com> wrote:

> Hi Omkar,
>
>   Thanks for the quick reply, and sorry for not being able to get the
> required logs that you have asked for.
>
>   But in the meanwhile I just wanted to check if you can get a clue with
> the information I have now. I am seeing the following kind of error message
> in AppMaster.stderr whenever this failure is happening. I don't know why
> does it happen, the getProgress() call that I have implemented
> in RMCallbackHandler could never return a negative value! Doesn't this
> error mean that this getProgress() is giving a -ve value?
>
> Exception in thread "AMRM Heartbeater thread"
> java.lang.IllegalArgumentException: Progress indicator should not be
> negative
>         at
> com.google.common.base.Preconditions.checkArgument(Preconditions.java:88)
>         at
> org.apache.hadoop.yarn.client.api.impl.AMRMClientImpl.allocate(AMRMClientImpl.java:199)
>         at
> org.apache.hadoop.yarn.client.api.async.impl.AMRMClientAsyncImpl$HeartbeatThread.run(AMRMClientAsyncImpl.java:224)
>
> Thanks,
> Kishore
>
>
> On Fri, Sep 13, 2013 at 2:59 AM, Omkar Joshi <oj...@hortonworks.com>wrote:
>
>> Can you give more information? logs (complete) will help a lot around
>> this time frame. Are the containers getting assigned via scheduler? is it
>> failing when node manager tries to start container? I clearly see the
>> diagnostic message is empty but do you see anything in NM logs? Also if
>> there were running containers on the machine before launching new ones..
>> then are they killed? or they are still hanging around? can you also try
>> applying patch "https://issues.apache.org/jira/browse/YARN-1053" ? and
>> check if you can see any message?
>>
>> Thanks,
>> Omkar Joshi
>> *Hortonworks Inc.* <http://www.hortonworks.com>
>>
>>
>> On Thu, Sep 12, 2013 at 6:15 AM, Krishna Kishore Bonagiri <
>> write2kishore@gmail.com> wrote:
>>
>>> Hi,
>>>   I am using 2.1.0-beta and have seen container allocation failing
>>> randomly even when running the same application in a loop. I know that the
>>> cluster has enough resources to give, because it gave the resources for the
>>> same application all the other times in the loop and ran it successfully.
>>>
>>>    I have observed a lot of the following kind of messages in the node
>>> manager's log whenever such failure happens, any clues as to why it happens?
>>>
>>> 2013-09-12 08:54:36,204 INFO
>>> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Sending
>>> out status for container: container_id { app_attempt_id { application_id {
>>> id: 2 cluster_timestamp: 1378990400253 } attemptId: 1 } id: 1 } state:
>>> C_RUNNING diagnostics: "" exit_status: -1000
>>> 2013-09-12 08:54:37,220 INFO
>>> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Sending
>>> out status for container: container_id { app_attempt_id { application_id {
>>> id: 2 cluster_timestamp: 1378990400253 } attemptId: 1 } id: 1 } state:
>>> C_RUNNING diagnostics: "" exit_status: -1000
>>> 2013-09-12 08:54:38,231 INFO
>>> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Sending
>>> out status for container: container_id { app_attempt_id { application_id {
>>> id: 2 cluster_timestamp: 1378990400253 } attemptId: 1 } id: 1 } state:
>>> C_RUNNING diagnostics: "" exit_status: -1000
>>> 2013-09-12 08:54:39,239 INFO
>>> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Sending
>>> out status for container: container_id { app_attempt_id { application_id {
>>> id: 2 cluster_timestamp: 1378990400253 } attemptId: 1 } id: 1 } state:
>>> C_RUNNING diagnostics: "" exit_status: -1000
>>> 2013-09-12 08:54:40,267 INFO
>>> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Sending
>>> out status for container: container_id { app_attempt_id { application_id {
>>> id: 2 cluster_timestamp: 1378990400253 } attemptId: 1 } id: 1 } state:
>>> C_RUNNING diagnostics: "" exit_status: -1000
>>> 2013-09-12 08:54:41,275 INFO
>>> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Sending
>>> out status for container: container_id { app_attempt_id { application_id {
>>> id: 2 cluster_timestamp: 1378990400253 } attemptId: 1 } id: 1 } state:
>>> C_RUNNING diagnostics: "" exit_status: -1000
>>> 2013-09-12 08:54:42,283 INFO
>>> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Sending
>>> out status for container: container_id { app_attempt_id { application_id {
>>> id: 2 cluster_timestamp: 1378990400253 } attemptId: 1 } id: 1 } state:
>>> C_RUNNING diagnostics: "" exit_status: -1000
>>> 2013-09-12 08:54:43,289 INFO
>>> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Sending
>>> out status for container: container_id { app_attempt_id { application_id {
>>> id: 2 cluster_timestamp: 1378990400253 } attemptId: 1 } id: 1 } state:
>>> C_RUNNING diagnostics: "" exit_status: -1000
>>>
>>>
>>> Thanks,
>>> Kishore
>>>
>>
>>
>> CONFIDENTIALITY NOTICE
>> NOTICE: This message is intended for the use of the individual or entity
>> to which it is addressed and may contain information that is confidential,
>> privileged and exempt from disclosure under applicable law. If the reader
>> of this message is not the intended recipient, you are hereby notified that
>> any printing, copying, dissemination, distribution, disclosure or
>> forwarding of this communication is strictly prohibited. If you have
>> received this communication in error, please contact the sender immediately
>> and delete it from your system. Thank You.
>
>
>

-- 
CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to 
which it is addressed and may contain information that is confidential, 
privileged and exempt from disclosure under applicable law. If the reader 
of this message is not the intended recipient, you are hereby notified that 
any printing, copying, dissemination, distribution, disclosure or 
forwarding of this communication is strictly prohibited. If you have 
received this communication in error, please contact the sender immediately 
and delete it from your system. Thank You.

Re: Container allocation fails randomly

Posted by Omkar Joshi <oj...@hortonworks.com>.

This is clearly an AM bug. are you using MR-AM or custom AM? you should
check AM code which is computing progress. I suspect there must be some
float computation problems. If it is an MR-AM problem then please file a
map reduce bug.

Thanks,
Omkar Joshi
*Hortonworks Inc.* <http://www.hortonworks.com>


On Tue, Sep 17, 2013 at 2:47 AM, Krishna Kishore Bonagiri <
write2kishore@gmail.com> wrote:

> Hi Omkar,
>
>   Thanks for the quick reply, and sorry for not being able to get the
> required logs that you have asked for.
>
>   But in the meanwhile I just wanted to check if you can get a clue with
> the information I have now. I am seeing the following kind of error message
> in AppMaster.stderr whenever this failure is happening. I don't know why
> does it happen, the getProgress() call that I have implemented
> in RMCallbackHandler could never return a negative value! Doesn't this
> error mean that this getProgress() is giving a -ve value?
>
> Exception in thread "AMRM Heartbeater thread"
> java.lang.IllegalArgumentException: Progress indicator should not be
> negative
>         at
> com.google.common.base.Preconditions.checkArgument(Preconditions.java:88)
>         at
> org.apache.hadoop.yarn.client.api.impl.AMRMClientImpl.allocate(AMRMClientImpl.java:199)
>         at
> org.apache.hadoop.yarn.client.api.async.impl.AMRMClientAsyncImpl$HeartbeatThread.run(AMRMClientAsyncImpl.java:224)
>
> Thanks,
> Kishore
>
>
> On Fri, Sep 13, 2013 at 2:59 AM, Omkar Joshi <oj...@hortonworks.com>wrote:
>
>> Can you give more information? logs (complete) will help a lot around
>> this time frame. Are the containers getting assigned via scheduler? is it
>> failing when node manager tries to start container? I clearly see the
>> diagnostic message is empty but do you see anything in NM logs? Also if
>> there were running containers on the machine before launching new ones..
>> then are they killed? or they are still hanging around? can you also try
>> applying patch "https://issues.apache.org/jira/browse/YARN-1053" ? and
>> check if you can see any message?
>>
>> Thanks,
>> Omkar Joshi
>> *Hortonworks Inc.* <http://www.hortonworks.com>
>>
>>
>> On Thu, Sep 12, 2013 at 6:15 AM, Krishna Kishore Bonagiri <
>> write2kishore@gmail.com> wrote:
>>
>>> Hi,
>>>   I am using 2.1.0-beta and have seen container allocation failing
>>> randomly even when running the same application in a loop. I know that the
>>> cluster has enough resources to give, because it gave the resources for the
>>> same application all the other times in the loop and ran it successfully.
>>>
>>>    I have observed a lot of the following kind of messages in the node
>>> manager's log whenever such failure happens, any clues as to why it happens?
>>>
>>> 2013-09-12 08:54:36,204 INFO
>>> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Sending
>>> out status for container: container_id { app_attempt_id { application_id {
>>> id: 2 cluster_timestamp: 1378990400253 } attemptId: 1 } id: 1 } state:
>>> C_RUNNING diagnostics: "" exit_status: -1000
>>> 2013-09-12 08:54:37,220 INFO
>>> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Sending
>>> out status for container: container_id { app_attempt_id { application_id {
>>> id: 2 cluster_timestamp: 1378990400253 } attemptId: 1 } id: 1 } state:
>>> C_RUNNING diagnostics: "" exit_status: -1000
>>> 2013-09-12 08:54:38,231 INFO
>>> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Sending
>>> out status for container: container_id { app_attempt_id { application_id {
>>> id: 2 cluster_timestamp: 1378990400253 } attemptId: 1 } id: 1 } state:
>>> C_RUNNING diagnostics: "" exit_status: -1000
>>> 2013-09-12 08:54:39,239 INFO
>>> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Sending
>>> out status for container: container_id { app_attempt_id { application_id {
>>> id: 2 cluster_timestamp: 1378990400253 } attemptId: 1 } id: 1 } state:
>>> C_RUNNING diagnostics: "" exit_status: -1000
>>> 2013-09-12 08:54:40,267 INFO
>>> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Sending
>>> out status for container: container_id { app_attempt_id { application_id {
>>> id: 2 cluster_timestamp: 1378990400253 } attemptId: 1 } id: 1 } state:
>>> C_RUNNING diagnostics: "" exit_status: -1000
>>> 2013-09-12 08:54:41,275 INFO
>>> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Sending
>>> out status for container: container_id { app_attempt_id { application_id {
>>> id: 2 cluster_timestamp: 1378990400253 } attemptId: 1 } id: 1 } state:
>>> C_RUNNING diagnostics: "" exit_status: -1000
>>> 2013-09-12 08:54:42,283 INFO
>>> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Sending
>>> out status for container: container_id { app_attempt_id { application_id {
>>> id: 2 cluster_timestamp: 1378990400253 } attemptId: 1 } id: 1 } state:
>>> C_RUNNING diagnostics: "" exit_status: -1000
>>> 2013-09-12 08:54:43,289 INFO
>>> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Sending
>>> out status for container: container_id { app_attempt_id { application_id {
>>> id: 2 cluster_timestamp: 1378990400253 } attemptId: 1 } id: 1 } state:
>>> C_RUNNING diagnostics: "" exit_status: -1000
>>>
>>>
>>> Thanks,
>>> Kishore
>>>
>>
>>
>> CONFIDENTIALITY NOTICE
>> NOTICE: This message is intended for the use of the individual or entity
>> to which it is addressed and may contain information that is confidential,
>> privileged and exempt from disclosure under applicable law. If the reader
>> of this message is not the intended recipient, you are hereby notified that
>> any printing, copying, dissemination, distribution, disclosure or
>> forwarding of this communication is strictly prohibited. If you have
>> received this communication in error, please contact the sender immediately
>> and delete it from your system. Thank You.
>
>
>

-- 
CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to 
which it is addressed and may contain information that is confidential, 
privileged and exempt from disclosure under applicable law. If the reader 
of this message is not the intended recipient, you are hereby notified that 
any printing, copying, dissemination, distribution, disclosure or 
forwarding of this communication is strictly prohibited. If you have 
received this communication in error, please contact the sender immediately 
and delete it from your system. Thank You.

Re: Container allocation fails randomly

Posted by Omkar Joshi <oj...@hortonworks.com>.

This is clearly an AM bug. are you using MR-AM or custom AM? you should
check AM code which is computing progress. I suspect there must be some
float computation problems. If it is an MR-AM problem then please file a
map reduce bug.

Thanks,
Omkar Joshi
*Hortonworks Inc.* <http://www.hortonworks.com>


On Tue, Sep 17, 2013 at 2:47 AM, Krishna Kishore Bonagiri <
write2kishore@gmail.com> wrote:

> Hi Omkar,
>
>   Thanks for the quick reply, and sorry for not being able to get the
> required logs that you have asked for.
>
>   But in the meanwhile I just wanted to check if you can get a clue with
> the information I have now. I am seeing the following kind of error message
> in AppMaster.stderr whenever this failure is happening. I don't know why
> does it happen, the getProgress() call that I have implemented
> in RMCallbackHandler could never return a negative value! Doesn't this
> error mean that this getProgress() is giving a -ve value?
>
> Exception in thread "AMRM Heartbeater thread"
> java.lang.IllegalArgumentException: Progress indicator should not be
> negative
>         at
> com.google.common.base.Preconditions.checkArgument(Preconditions.java:88)
>         at
> org.apache.hadoop.yarn.client.api.impl.AMRMClientImpl.allocate(AMRMClientImpl.java:199)
>         at
> org.apache.hadoop.yarn.client.api.async.impl.AMRMClientAsyncImpl$HeartbeatThread.run(AMRMClientAsyncImpl.java:224)
>
> Thanks,
> Kishore
>
>
> On Fri, Sep 13, 2013 at 2:59 AM, Omkar Joshi <oj...@hortonworks.com>wrote:
>
>> Can you give more information? logs (complete) will help a lot around
>> this time frame. Are the containers getting assigned via scheduler? is it
>> failing when node manager tries to start container? I clearly see the
>> diagnostic message is empty but do you see anything in NM logs? Also if
>> there were running containers on the machine before launching new ones..
>> then are they killed? or they are still hanging around? can you also try
>> applying patch "https://issues.apache.org/jira/browse/YARN-1053" ? and
>> check if you can see any message?
>>
>> Thanks,
>> Omkar Joshi
>> *Hortonworks Inc.* <http://www.hortonworks.com>
>>
>>
>> On Thu, Sep 12, 2013 at 6:15 AM, Krishna Kishore Bonagiri <
>> write2kishore@gmail.com> wrote:
>>
>>> Hi,
>>>   I am using 2.1.0-beta and have seen container allocation failing
>>> randomly even when running the same application in a loop. I know that the
>>> cluster has enough resources to give, because it gave the resources for the
>>> same application all the other times in the loop and ran it successfully.
>>>
>>>    I have observed a lot of the following kind of messages in the node
>>> manager's log whenever such failure happens, any clues as to why it happens?
>>>
>>> 2013-09-12 08:54:36,204 INFO
>>> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Sending
>>> out status for container: container_id { app_attempt_id { application_id {
>>> id: 2 cluster_timestamp: 1378990400253 } attemptId: 1 } id: 1 } state:
>>> C_RUNNING diagnostics: "" exit_status: -1000
>>> 2013-09-12 08:54:37,220 INFO
>>> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Sending
>>> out status for container: container_id { app_attempt_id { application_id {
>>> id: 2 cluster_timestamp: 1378990400253 } attemptId: 1 } id: 1 } state:
>>> C_RUNNING diagnostics: "" exit_status: -1000
>>> 2013-09-12 08:54:38,231 INFO
>>> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Sending
>>> out status for container: container_id { app_attempt_id { application_id {
>>> id: 2 cluster_timestamp: 1378990400253 } attemptId: 1 } id: 1 } state:
>>> C_RUNNING diagnostics: "" exit_status: -1000
>>> 2013-09-12 08:54:39,239 INFO
>>> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Sending
>>> out status for container: container_id { app_attempt_id { application_id {
>>> id: 2 cluster_timestamp: 1378990400253 } attemptId: 1 } id: 1 } state:
>>> C_RUNNING diagnostics: "" exit_status: -1000
>>> 2013-09-12 08:54:40,267 INFO
>>> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Sending
>>> out status for container: container_id { app_attempt_id { application_id {
>>> id: 2 cluster_timestamp: 1378990400253 } attemptId: 1 } id: 1 } state:
>>> C_RUNNING diagnostics: "" exit_status: -1000
>>> 2013-09-12 08:54:41,275 INFO
>>> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Sending
>>> out status for container: container_id { app_attempt_id { application_id {
>>> id: 2 cluster_timestamp: 1378990400253 } attemptId: 1 } id: 1 } state:
>>> C_RUNNING diagnostics: "" exit_status: -1000
>>> 2013-09-12 08:54:42,283 INFO
>>> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Sending
>>> out status for container: container_id { app_attempt_id { application_id {
>>> id: 2 cluster_timestamp: 1378990400253 } attemptId: 1 } id: 1 } state:
>>> C_RUNNING diagnostics: "" exit_status: -1000
>>> 2013-09-12 08:54:43,289 INFO
>>> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Sending
>>> out status for container: container_id { app_attempt_id { application_id {
>>> id: 2 cluster_timestamp: 1378990400253 } attemptId: 1 } id: 1 } state:
>>> C_RUNNING diagnostics: "" exit_status: -1000
>>>
>>>
>>> Thanks,
>>> Kishore
>>>
>>
>>
>> CONFIDENTIALITY NOTICE
>> NOTICE: This message is intended for the use of the individual or entity
>> to which it is addressed and may contain information that is confidential,
>> privileged and exempt from disclosure under applicable law. If the reader
>> of this message is not the intended recipient, you are hereby notified that
>> any printing, copying, dissemination, distribution, disclosure or
>> forwarding of this communication is strictly prohibited. If you have
>> received this communication in error, please contact the sender immediately
>> and delete it from your system. Thank You.
>
>
>

-- 
CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to 
which it is addressed and may contain information that is confidential, 
privileged and exempt from disclosure under applicable law. If the reader 
of this message is not the intended recipient, you are hereby notified that 
any printing, copying, dissemination, distribution, disclosure or 
forwarding of this communication is strictly prohibited. If you have 
received this communication in error, please contact the sender immediately 
and delete it from your system. Thank You.

Re: Container allocation fails randomly

Posted by Omkar Joshi <oj...@hortonworks.com>.

This is clearly an AM bug. are you using MR-AM or custom AM? you should
check AM code which is computing progress. I suspect there must be some
float computation problems. If it is an MR-AM problem then please file a
map reduce bug.

Thanks,
Omkar Joshi
*Hortonworks Inc.* <http://www.hortonworks.com>


On Tue, Sep 17, 2013 at 2:47 AM, Krishna Kishore Bonagiri <
write2kishore@gmail.com> wrote:

> Hi Omkar,
>
>   Thanks for the quick reply, and sorry for not being able to get the
> required logs that you have asked for.
>
>   But in the meanwhile I just wanted to check if you can get a clue with
> the information I have now. I am seeing the following kind of error message
> in AppMaster.stderr whenever this failure is happening. I don't know why
> does it happen, the getProgress() call that I have implemented
> in RMCallbackHandler could never return a negative value! Doesn't this
> error mean that this getProgress() is giving a -ve value?
>
> Exception in thread "AMRM Heartbeater thread"
> java.lang.IllegalArgumentException: Progress indicator should not be
> negative
>         at
> com.google.common.base.Preconditions.checkArgument(Preconditions.java:88)
>         at
> org.apache.hadoop.yarn.client.api.impl.AMRMClientImpl.allocate(AMRMClientImpl.java:199)
>         at
> org.apache.hadoop.yarn.client.api.async.impl.AMRMClientAsyncImpl$HeartbeatThread.run(AMRMClientAsyncImpl.java:224)
>
> Thanks,
> Kishore
>
>
> On Fri, Sep 13, 2013 at 2:59 AM, Omkar Joshi <oj...@hortonworks.com>wrote:
>
>> Can you give more information? logs (complete) will help a lot around
>> this time frame. Are the containers getting assigned via scheduler? is it
>> failing when node manager tries to start container? I clearly see the
>> diagnostic message is empty but do you see anything in NM logs? Also if
>> there were running containers on the machine before launching new ones..
>> then are they killed? or they are still hanging around? can you also try
>> applying patch "https://issues.apache.org/jira/browse/YARN-1053" ? and
>> check if you can see any message?
>>
>> Thanks,
>> Omkar Joshi
>> *Hortonworks Inc.* <http://www.hortonworks.com>
>>
>>
>> On Thu, Sep 12, 2013 at 6:15 AM, Krishna Kishore Bonagiri <
>> write2kishore@gmail.com> wrote:
>>
>>> Hi,
>>>   I am using 2.1.0-beta and have seen container allocation failing
>>> randomly even when running the same application in a loop. I know that the
>>> cluster has enough resources to give, because it gave the resources for the
>>> same application all the other times in the loop and ran it successfully.
>>>
>>>    I have observed a lot of the following kind of messages in the node
>>> manager's log whenever such failure happens, any clues as to why it happens?
>>>
>>> 2013-09-12 08:54:36,204 INFO
>>> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Sending
>>> out status for container: container_id { app_attempt_id { application_id {
>>> id: 2 cluster_timestamp: 1378990400253 } attemptId: 1 } id: 1 } state:
>>> C_RUNNING diagnostics: "" exit_status: -1000
>>> 2013-09-12 08:54:37,220 INFO
>>> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Sending
>>> out status for container: container_id { app_attempt_id { application_id {
>>> id: 2 cluster_timestamp: 1378990400253 } attemptId: 1 } id: 1 } state:
>>> C_RUNNING diagnostics: "" exit_status: -1000
>>> 2013-09-12 08:54:38,231 INFO
>>> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Sending
>>> out status for container: container_id { app_attempt_id { application_id {
>>> id: 2 cluster_timestamp: 1378990400253 } attemptId: 1 } id: 1 } state:
>>> C_RUNNING diagnostics: "" exit_status: -1000
>>> 2013-09-12 08:54:39,239 INFO
>>> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Sending
>>> out status for container: container_id { app_attempt_id { application_id {
>>> id: 2 cluster_timestamp: 1378990400253 } attemptId: 1 } id: 1 } state:
>>> C_RUNNING diagnostics: "" exit_status: -1000
>>> 2013-09-12 08:54:40,267 INFO
>>> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Sending
>>> out status for container: container_id { app_attempt_id { application_id {
>>> id: 2 cluster_timestamp: 1378990400253 } attemptId: 1 } id: 1 } state:
>>> C_RUNNING diagnostics: "" exit_status: -1000
>>> 2013-09-12 08:54:41,275 INFO
>>> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Sending
>>> out status for container: container_id { app_attempt_id { application_id {
>>> id: 2 cluster_timestamp: 1378990400253 } attemptId: 1 } id: 1 } state:
>>> C_RUNNING diagnostics: "" exit_status: -1000
>>> 2013-09-12 08:54:42,283 INFO
>>> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Sending
>>> out status for container: container_id { app_attempt_id { application_id {
>>> id: 2 cluster_timestamp: 1378990400253 } attemptId: 1 } id: 1 } state:
>>> C_RUNNING diagnostics: "" exit_status: -1000
>>> 2013-09-12 08:54:43,289 INFO
>>> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Sending
>>> out status for container: container_id { app_attempt_id { application_id {
>>> id: 2 cluster_timestamp: 1378990400253 } attemptId: 1 } id: 1 } state:
>>> C_RUNNING diagnostics: "" exit_status: -1000
>>>
>>>
>>> Thanks,
>>> Kishore
>>>
>>
>>
>> CONFIDENTIALITY NOTICE
>> NOTICE: This message is intended for the use of the individual or entity
>> to which it is addressed and may contain information that is confidential,
>> privileged and exempt from disclosure under applicable law. If the reader
>> of this message is not the intended recipient, you are hereby notified that
>> any printing, copying, dissemination, distribution, disclosure or
>> forwarding of this communication is strictly prohibited. If you have
>> received this communication in error, please contact the sender immediately
>> and delete it from your system. Thank You.
>
>
>

-- 
CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to 
which it is addressed and may contain information that is confidential, 
privileged and exempt from disclosure under applicable law. If the reader 
of this message is not the intended recipient, you are hereby notified that 
any printing, copying, dissemination, distribution, disclosure or 
forwarding of this communication is strictly prohibited. If you have 
received this communication in error, please contact the sender immediately 
and delete it from your system. Thank You.

Re: Container allocation fails randomly

Posted by Krishna Kishore Bonagiri <wr...@gmail.com>.

Hi Omkar,

  Thanks for the quick reply, and sorry for not being able to get the
required logs that you have asked for.

  But in the meanwhile I just wanted to check if you can get a clue with
the information I have now. I am seeing the following kind of error message
in AppMaster.stderr whenever this failure is happening. I don't know why
does it happen, the getProgress() call that I have implemented
in RMCallbackHandler could never return a negative value! Doesn't this
error mean that this getProgress() is giving a -ve value?

Exception in thread "AMRM Heartbeater thread"
java.lang.IllegalArgumentException: Progress indicator should not be
negative
        at
com.google.common.base.Preconditions.checkArgument(Preconditions.java:88)
        at
org.apache.hadoop.yarn.client.api.impl.AMRMClientImpl.allocate(AMRMClientImpl.java:199)
        at
org.apache.hadoop.yarn.client.api.async.impl.AMRMClientAsyncImpl$HeartbeatThread.run(AMRMClientAsyncImpl.java:224)

Thanks,
Kishore


On Fri, Sep 13, 2013 at 2:59 AM, Omkar Joshi <oj...@hortonworks.com> wrote:

> Can you give more information? logs (complete) will help a lot around this
> time frame. Are the containers getting assigned via scheduler? is it
> failing when node manager tries to start container? I clearly see the
> diagnostic message is empty but do you see anything in NM logs? Also if
> there were running containers on the machine before launching new ones..
> then are they killed? or they are still hanging around? can you also try
> applying patch "https://issues.apache.org/jira/browse/YARN-1053" ? and
> check if you can see any message?
>
> Thanks,
> Omkar Joshi
> *Hortonworks Inc.* <http://www.hortonworks.com>
>
>
> On Thu, Sep 12, 2013 at 6:15 AM, Krishna Kishore Bonagiri <
> write2kishore@gmail.com> wrote:
>
>> Hi,
>>   I am using 2.1.0-beta and have seen container allocation failing
>> randomly even when running the same application in a loop. I know that the
>> cluster has enough resources to give, because it gave the resources for the
>> same application all the other times in the loop and ran it successfully.
>>
>>    I have observed a lot of the following kind of messages in the node
>> manager's log whenever such failure happens, any clues as to why it happens?
>>
>> 2013-09-12 08:54:36,204 INFO
>> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Sending
>> out status for container: container_id { app_attempt_id { application_id {
>> id: 2 cluster_timestamp: 1378990400253 } attemptId: 1 } id: 1 } state:
>> C_RUNNING diagnostics: "" exit_status: -1000
>> 2013-09-12 08:54:37,220 INFO
>> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Sending
>> out status for container: container_id { app_attempt_id { application_id {
>> id: 2 cluster_timestamp: 1378990400253 } attemptId: 1 } id: 1 } state:
>> C_RUNNING diagnostics: "" exit_status: -1000
>> 2013-09-12 08:54:38,231 INFO
>> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Sending
>> out status for container: container_id { app_attempt_id { application_id {
>> id: 2 cluster_timestamp: 1378990400253 } attemptId: 1 } id: 1 } state:
>> C_RUNNING diagnostics: "" exit_status: -1000
>> 2013-09-12 08:54:39,239 INFO
>> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Sending
>> out status for container: container_id { app_attempt_id { application_id {
>> id: 2 cluster_timestamp: 1378990400253 } attemptId: 1 } id: 1 } state:
>> C_RUNNING diagnostics: "" exit_status: -1000
>> 2013-09-12 08:54:40,267 INFO
>> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Sending
>> out status for container: container_id { app_attempt_id { application_id {
>> id: 2 cluster_timestamp: 1378990400253 } attemptId: 1 } id: 1 } state:
>> C_RUNNING diagnostics: "" exit_status: -1000
>> 2013-09-12 08:54:41,275 INFO
>> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Sending
>> out status for container: container_id { app_attempt_id { application_id {
>> id: 2 cluster_timestamp: 1378990400253 } attemptId: 1 } id: 1 } state:
>> C_RUNNING diagnostics: "" exit_status: -1000
>> 2013-09-12 08:54:42,283 INFO
>> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Sending
>> out status for container: container_id { app_attempt_id { application_id {
>> id: 2 cluster_timestamp: 1378990400253 } attemptId: 1 } id: 1 } state:
>> C_RUNNING diagnostics: "" exit_status: -1000
>> 2013-09-12 08:54:43,289 INFO
>> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Sending
>> out status for container: container_id { app_attempt_id { application_id {
>> id: 2 cluster_timestamp: 1378990400253 } attemptId: 1 } id: 1 } state:
>> C_RUNNING diagnostics: "" exit_status: -1000
>>
>>
>> Thanks,
>> Kishore
>>
>
>
> CONFIDENTIALITY NOTICE
> NOTICE: This message is intended for the use of the individual or entity
> to which it is addressed and may contain information that is confidential,
> privileged and exempt from disclosure under applicable law. If the reader
> of this message is not the intended recipient, you are hereby notified that
> any printing, copying, dissemination, distribution, disclosure or
> forwarding of this communication is strictly prohibited. If you have
> received this communication in error, please contact the sender immediately
> and delete it from your system. Thank You.

Re: Container allocation fails randomly

Posted by Krishna Kishore Bonagiri <wr...@gmail.com>.

Hi Omkar,

  Thanks for the quick reply, and sorry for not being able to get the
required logs that you have asked for.

  But in the meanwhile I just wanted to check if you can get a clue with
the information I have now. I am seeing the following kind of error message
in AppMaster.stderr whenever this failure is happening. I don't know why
does it happen, the getProgress() call that I have implemented
in RMCallbackHandler could never return a negative value! Doesn't this
error mean that this getProgress() is giving a -ve value?

Exception in thread "AMRM Heartbeater thread"
java.lang.IllegalArgumentException: Progress indicator should not be
negative
        at
com.google.common.base.Preconditions.checkArgument(Preconditions.java:88)
        at
org.apache.hadoop.yarn.client.api.impl.AMRMClientImpl.allocate(AMRMClientImpl.java:199)
        at
org.apache.hadoop.yarn.client.api.async.impl.AMRMClientAsyncImpl$HeartbeatThread.run(AMRMClientAsyncImpl.java:224)

Thanks,
Kishore


On Fri, Sep 13, 2013 at 2:59 AM, Omkar Joshi <oj...@hortonworks.com> wrote:

> Can you give more information? logs (complete) will help a lot around this
> time frame. Are the containers getting assigned via scheduler? is it
> failing when node manager tries to start container? I clearly see the
> diagnostic message is empty but do you see anything in NM logs? Also if
> there were running containers on the machine before launching new ones..
> then are they killed? or they are still hanging around? can you also try
> applying patch "https://issues.apache.org/jira/browse/YARN-1053" ? and
> check if you can see any message?
>
> Thanks,
> Omkar Joshi
> *Hortonworks Inc.* <http://www.hortonworks.com>
>
>
> On Thu, Sep 12, 2013 at 6:15 AM, Krishna Kishore Bonagiri <
> write2kishore@gmail.com> wrote:
>
>> Hi,
>>   I am using 2.1.0-beta and have seen container allocation failing
>> randomly even when running the same application in a loop. I know that the
>> cluster has enough resources to give, because it gave the resources for the
>> same application all the other times in the loop and ran it successfully.
>>
>>    I have observed a lot of the following kind of messages in the node
>> manager's log whenever such failure happens, any clues as to why it happens?
>>
>> 2013-09-12 08:54:36,204 INFO
>> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Sending
>> out status for container: container_id { app_attempt_id { application_id {
>> id: 2 cluster_timestamp: 1378990400253 } attemptId: 1 } id: 1 } state:
>> C_RUNNING diagnostics: "" exit_status: -1000
>> 2013-09-12 08:54:37,220 INFO
>> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Sending
>> out status for container: container_id { app_attempt_id { application_id {
>> id: 2 cluster_timestamp: 1378990400253 } attemptId: 1 } id: 1 } state:
>> C_RUNNING diagnostics: "" exit_status: -1000
>> 2013-09-12 08:54:38,231 INFO
>> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Sending
>> out status for container: container_id { app_attempt_id { application_id {
>> id: 2 cluster_timestamp: 1378990400253 } attemptId: 1 } id: 1 } state:
>> C_RUNNING diagnostics: "" exit_status: -1000
>> 2013-09-12 08:54:39,239 INFO
>> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Sending
>> out status for container: container_id { app_attempt_id { application_id {
>> id: 2 cluster_timestamp: 1378990400253 } attemptId: 1 } id: 1 } state:
>> C_RUNNING diagnostics: "" exit_status: -1000
>> 2013-09-12 08:54:40,267 INFO
>> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Sending
>> out status for container: container_id { app_attempt_id { application_id {
>> id: 2 cluster_timestamp: 1378990400253 } attemptId: 1 } id: 1 } state:
>> C_RUNNING diagnostics: "" exit_status: -1000
>> 2013-09-12 08:54:41,275 INFO
>> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Sending
>> out status for container: container_id { app_attempt_id { application_id {
>> id: 2 cluster_timestamp: 1378990400253 } attemptId: 1 } id: 1 } state:
>> C_RUNNING diagnostics: "" exit_status: -1000
>> 2013-09-12 08:54:42,283 INFO
>> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Sending
>> out status for container: container_id { app_attempt_id { application_id {
>> id: 2 cluster_timestamp: 1378990400253 } attemptId: 1 } id: 1 } state:
>> C_RUNNING diagnostics: "" exit_status: -1000
>> 2013-09-12 08:54:43,289 INFO
>> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Sending
>> out status for container: container_id { app_attempt_id { application_id {
>> id: 2 cluster_timestamp: 1378990400253 } attemptId: 1 } id: 1 } state:
>> C_RUNNING diagnostics: "" exit_status: -1000
>>
>>
>> Thanks,
>> Kishore
>>
>
>
> CONFIDENTIALITY NOTICE
> NOTICE: This message is intended for the use of the individual or entity
> to which it is addressed and may contain information that is confidential,
> privileged and exempt from disclosure under applicable law. If the reader
> of this message is not the intended recipient, you are hereby notified that
> any printing, copying, dissemination, distribution, disclosure or
> forwarding of this communication is strictly prohibited. If you have
> received this communication in error, please contact the sender immediately
> and delete it from your system. Thank You.

Re: Container allocation fails randomly

Posted by Krishna Kishore Bonagiri <wr...@gmail.com>.

Hi Omkar,

  Thanks for the quick reply, and sorry for not being able to get the
required logs that you have asked for.

  But in the meanwhile I just wanted to check if you can get a clue with
the information I have now. I am seeing the following kind of error message
in AppMaster.stderr whenever this failure is happening. I don't know why
does it happen, the getProgress() call that I have implemented
in RMCallbackHandler could never return a negative value! Doesn't this
error mean that this getProgress() is giving a -ve value?

Exception in thread "AMRM Heartbeater thread"
java.lang.IllegalArgumentException: Progress indicator should not be
negative
        at
com.google.common.base.Preconditions.checkArgument(Preconditions.java:88)
        at
org.apache.hadoop.yarn.client.api.impl.AMRMClientImpl.allocate(AMRMClientImpl.java:199)
        at
org.apache.hadoop.yarn.client.api.async.impl.AMRMClientAsyncImpl$HeartbeatThread.run(AMRMClientAsyncImpl.java:224)

Thanks,
Kishore


On Fri, Sep 13, 2013 at 2:59 AM, Omkar Joshi <oj...@hortonworks.com> wrote:

> Can you give more information? logs (complete) will help a lot around this
> time frame. Are the containers getting assigned via scheduler? is it
> failing when node manager tries to start container? I clearly see the
> diagnostic message is empty but do you see anything in NM logs? Also if
> there were running containers on the machine before launching new ones..
> then are they killed? or they are still hanging around? can you also try
> applying patch "https://issues.apache.org/jira/browse/YARN-1053" ? and
> check if you can see any message?
>
> Thanks,
> Omkar Joshi
> *Hortonworks Inc.* <http://www.hortonworks.com>
>
>
> On Thu, Sep 12, 2013 at 6:15 AM, Krishna Kishore Bonagiri <
> write2kishore@gmail.com> wrote:
>
>> Hi,
>>   I am using 2.1.0-beta and have seen container allocation failing
>> randomly even when running the same application in a loop. I know that the
>> cluster has enough resources to give, because it gave the resources for the
>> same application all the other times in the loop and ran it successfully.
>>
>>    I have observed a lot of the following kind of messages in the node
>> manager's log whenever such failure happens, any clues as to why it happens?
>>
>> 2013-09-12 08:54:36,204 INFO
>> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Sending
>> out status for container: container_id { app_attempt_id { application_id {
>> id: 2 cluster_timestamp: 1378990400253 } attemptId: 1 } id: 1 } state:
>> C_RUNNING diagnostics: "" exit_status: -1000
>> 2013-09-12 08:54:37,220 INFO
>> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Sending
>> out status for container: container_id { app_attempt_id { application_id {
>> id: 2 cluster_timestamp: 1378990400253 } attemptId: 1 } id: 1 } state:
>> C_RUNNING diagnostics: "" exit_status: -1000
>> 2013-09-12 08:54:38,231 INFO
>> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Sending
>> out status for container: container_id { app_attempt_id { application_id {
>> id: 2 cluster_timestamp: 1378990400253 } attemptId: 1 } id: 1 } state:
>> C_RUNNING diagnostics: "" exit_status: -1000
>> 2013-09-12 08:54:39,239 INFO
>> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Sending
>> out status for container: container_id { app_attempt_id { application_id {
>> id: 2 cluster_timestamp: 1378990400253 } attemptId: 1 } id: 1 } state:
>> C_RUNNING diagnostics: "" exit_status: -1000
>> 2013-09-12 08:54:40,267 INFO
>> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Sending
>> out status for container: container_id { app_attempt_id { application_id {
>> id: 2 cluster_timestamp: 1378990400253 } attemptId: 1 } id: 1 } state:
>> C_RUNNING diagnostics: "" exit_status: -1000
>> 2013-09-12 08:54:41,275 INFO
>> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Sending
>> out status for container: container_id { app_attempt_id { application_id {
>> id: 2 cluster_timestamp: 1378990400253 } attemptId: 1 } id: 1 } state:
>> C_RUNNING diagnostics: "" exit_status: -1000
>> 2013-09-12 08:54:42,283 INFO
>> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Sending
>> out status for container: container_id { app_attempt_id { application_id {
>> id: 2 cluster_timestamp: 1378990400253 } attemptId: 1 } id: 1 } state:
>> C_RUNNING diagnostics: "" exit_status: -1000
>> 2013-09-12 08:54:43,289 INFO
>> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Sending
>> out status for container: container_id { app_attempt_id { application_id {
>> id: 2 cluster_timestamp: 1378990400253 } attemptId: 1 } id: 1 } state:
>> C_RUNNING diagnostics: "" exit_status: -1000
>>
>>
>> Thanks,
>> Kishore
>>
>
>
> CONFIDENTIALITY NOTICE
> NOTICE: This message is intended for the use of the individual or entity
> to which it is addressed and may contain information that is confidential,
> privileged and exempt from disclosure under applicable law. If the reader
> of this message is not the intended recipient, you are hereby notified that
> any printing, copying, dissemination, distribution, disclosure or
> forwarding of this communication is strictly prohibited. If you have
> received this communication in error, please contact the sender immediately
> and delete it from your system. Thank You.

Re: Container allocation fails randomly

Posted by Krishna Kishore Bonagiri <wr...@gmail.com>.

Hi Omkar,

  Thanks for the quick reply, and sorry for not being able to get the
required logs that you have asked for.

  But in the meanwhile I just wanted to check if you can get a clue with
the information I have now. I am seeing the following kind of error message
in AppMaster.stderr whenever this failure is happening. I don't know why
does it happen, the getProgress() call that I have implemented
in RMCallbackHandler could never return a negative value! Doesn't this
error mean that this getProgress() is giving a -ve value?

Exception in thread "AMRM Heartbeater thread"
java.lang.IllegalArgumentException: Progress indicator should not be
negative
        at
com.google.common.base.Preconditions.checkArgument(Preconditions.java:88)
        at
org.apache.hadoop.yarn.client.api.impl.AMRMClientImpl.allocate(AMRMClientImpl.java:199)
        at
org.apache.hadoop.yarn.client.api.async.impl.AMRMClientAsyncImpl$HeartbeatThread.run(AMRMClientAsyncImpl.java:224)

Thanks,
Kishore


On Fri, Sep 13, 2013 at 2:59 AM, Omkar Joshi <oj...@hortonworks.com> wrote:

> Can you give more information? logs (complete) will help a lot around this
> time frame. Are the containers getting assigned via scheduler? is it
> failing when node manager tries to start container? I clearly see the
> diagnostic message is empty but do you see anything in NM logs? Also if
> there were running containers on the machine before launching new ones..
> then are they killed? or they are still hanging around? can you also try
> applying patch "https://issues.apache.org/jira/browse/YARN-1053" ? and
> check if you can see any message?
>
> Thanks,
> Omkar Joshi
> *Hortonworks Inc.* <http://www.hortonworks.com>
>
>
> On Thu, Sep 12, 2013 at 6:15 AM, Krishna Kishore Bonagiri <
> write2kishore@gmail.com> wrote:
>
>> Hi,
>>   I am using 2.1.0-beta and have seen container allocation failing
>> randomly even when running the same application in a loop. I know that the
>> cluster has enough resources to give, because it gave the resources for the
>> same application all the other times in the loop and ran it successfully.
>>
>>    I have observed a lot of the following kind of messages in the node
>> manager's log whenever such failure happens, any clues as to why it happens?
>>
>> 2013-09-12 08:54:36,204 INFO
>> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Sending
>> out status for container: container_id { app_attempt_id { application_id {
>> id: 2 cluster_timestamp: 1378990400253 } attemptId: 1 } id: 1 } state:
>> C_RUNNING diagnostics: "" exit_status: -1000
>> 2013-09-12 08:54:37,220 INFO
>> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Sending
>> out status for container: container_id { app_attempt_id { application_id {
>> id: 2 cluster_timestamp: 1378990400253 } attemptId: 1 } id: 1 } state:
>> C_RUNNING diagnostics: "" exit_status: -1000
>> 2013-09-12 08:54:38,231 INFO
>> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Sending
>> out status for container: container_id { app_attempt_id { application_id {
>> id: 2 cluster_timestamp: 1378990400253 } attemptId: 1 } id: 1 } state:
>> C_RUNNING diagnostics: "" exit_status: -1000
>> 2013-09-12 08:54:39,239 INFO
>> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Sending
>> out status for container: container_id { app_attempt_id { application_id {
>> id: 2 cluster_timestamp: 1378990400253 } attemptId: 1 } id: 1 } state:
>> C_RUNNING diagnostics: "" exit_status: -1000
>> 2013-09-12 08:54:40,267 INFO
>> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Sending
>> out status for container: container_id { app_attempt_id { application_id {
>> id: 2 cluster_timestamp: 1378990400253 } attemptId: 1 } id: 1 } state:
>> C_RUNNING diagnostics: "" exit_status: -1000
>> 2013-09-12 08:54:41,275 INFO
>> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Sending
>> out status for container: container_id { app_attempt_id { application_id {
>> id: 2 cluster_timestamp: 1378990400253 } attemptId: 1 } id: 1 } state:
>> C_RUNNING diagnostics: "" exit_status: -1000
>> 2013-09-12 08:54:42,283 INFO
>> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Sending
>> out status for container: container_id { app_attempt_id { application_id {
>> id: 2 cluster_timestamp: 1378990400253 } attemptId: 1 } id: 1 } state:
>> C_RUNNING diagnostics: "" exit_status: -1000
>> 2013-09-12 08:54:43,289 INFO
>> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Sending
>> out status for container: container_id { app_attempt_id { application_id {
>> id: 2 cluster_timestamp: 1378990400253 } attemptId: 1 } id: 1 } state:
>> C_RUNNING diagnostics: "" exit_status: -1000
>>
>>
>> Thanks,
>> Kishore
>>
>
>
> CONFIDENTIALITY NOTICE
> NOTICE: This message is intended for the use of the individual or entity
> to which it is addressed and may contain information that is confidential,
> privileged and exempt from disclosure under applicable law. If the reader
> of this message is not the intended recipient, you are hereby notified that
> any printing, copying, dissemination, distribution, disclosure or
> forwarding of this communication is strictly prohibited. If you have
> received this communication in error, please contact the sender immediately
> and delete it from your system. Thank You.

Re: Container allocation fails randomly

Posted by Omkar Joshi <oj...@hortonworks.com>.

Can you give more information? logs (complete) will help a lot around this
time frame. Are the containers getting assigned via scheduler? is it
failing when node manager tries to start container? I clearly see the
diagnostic message is empty but do you see anything in NM logs? Also if
there were running containers on the machine before launching new ones..
then are they killed? or they are still hanging around? can you also try
applying patch "https://issues.apache.org/jira/browse/YARN-1053" ? and
check if you can see any message?

Thanks,
Omkar Joshi
*Hortonworks Inc.* <http://www.hortonworks.com>


On Thu, Sep 12, 2013 at 6:15 AM, Krishna Kishore Bonagiri <
write2kishore@gmail.com> wrote:

> Hi,
>   I am using 2.1.0-beta and have seen container allocation failing
> randomly even when running the same application in a loop. I know that the
> cluster has enough resources to give, because it gave the resources for the
> same application all the other times in the loop and ran it successfully.
>
>    I have observed a lot of the following kind of messages in the node
> manager's log whenever such failure happens, any clues as to why it happens?
>
> 2013-09-12 08:54:36,204 INFO
> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Sending
> out status for container: container_id { app_attempt_id { application_id {
> id: 2 cluster_timestamp: 1378990400253 } attemptId: 1 } id: 1 } state:
> C_RUNNING diagnostics: "" exit_status: -1000
> 2013-09-12 08:54:37,220 INFO
> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Sending
> out status for container: container_id { app_attempt_id { application_id {
> id: 2 cluster_timestamp: 1378990400253 } attemptId: 1 } id: 1 } state:
> C_RUNNING diagnostics: "" exit_status: -1000
> 2013-09-12 08:54:38,231 INFO
> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Sending
> out status for container: container_id { app_attempt_id { application_id {
> id: 2 cluster_timestamp: 1378990400253 } attemptId: 1 } id: 1 } state:
> C_RUNNING diagnostics: "" exit_status: -1000
> 2013-09-12 08:54:39,239 INFO
> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Sending
> out status for container: container_id { app_attempt_id { application_id {
> id: 2 cluster_timestamp: 1378990400253 } attemptId: 1 } id: 1 } state:
> C_RUNNING diagnostics: "" exit_status: -1000
> 2013-09-12 08:54:40,267 INFO
> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Sending
> out status for container: container_id { app_attempt_id { application_id {
> id: 2 cluster_timestamp: 1378990400253 } attemptId: 1 } id: 1 } state:
> C_RUNNING diagnostics: "" exit_status: -1000
> 2013-09-12 08:54:41,275 INFO
> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Sending
> out status for container: container_id { app_attempt_id { application_id {
> id: 2 cluster_timestamp: 1378990400253 } attemptId: 1 } id: 1 } state:
> C_RUNNING diagnostics: "" exit_status: -1000
> 2013-09-12 08:54:42,283 INFO
> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Sending
> out status for container: container_id { app_attempt_id { application_id {
> id: 2 cluster_timestamp: 1378990400253 } attemptId: 1 } id: 1 } state:
> C_RUNNING diagnostics: "" exit_status: -1000
> 2013-09-12 08:54:43,289 INFO
> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Sending
> out status for container: container_id { app_attempt_id { application_id {
> id: 2 cluster_timestamp: 1378990400253 } attemptId: 1 } id: 1 } state:
> C_RUNNING diagnostics: "" exit_status: -1000
>
>
> Thanks,
> Kishore
>

-- 
CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to 
which it is addressed and may contain information that is confidential, 
privileged and exempt from disclosure under applicable law. If the reader 
of this message is not the intended recipient, you are hereby notified that 
any printing, copying, dissemination, distribution, disclosure or 
forwarding of this communication is strictly prohibited. If you have 
received this communication in error, please contact the sender immediately 
and delete it from your system. Thank You.

Re: Container allocation fails randomly

Posted by Omkar Joshi <oj...@hortonworks.com>.

Can you give more information? logs (complete) will help a lot around this
time frame. Are the containers getting assigned via scheduler? is it
failing when node manager tries to start container? I clearly see the
diagnostic message is empty but do you see anything in NM logs? Also if
there were running containers on the machine before launching new ones..
then are they killed? or they are still hanging around? can you also try
applying patch "https://issues.apache.org/jira/browse/YARN-1053" ? and
check if you can see any message?

Thanks,
Omkar Joshi
*Hortonworks Inc.* <http://www.hortonworks.com>


On Thu, Sep 12, 2013 at 6:15 AM, Krishna Kishore Bonagiri <
write2kishore@gmail.com> wrote:

> Hi,
>   I am using 2.1.0-beta and have seen container allocation failing
> randomly even when running the same application in a loop. I know that the
> cluster has enough resources to give, because it gave the resources for the
> same application all the other times in the loop and ran it successfully.
>
>    I have observed a lot of the following kind of messages in the node
> manager's log whenever such failure happens, any clues as to why it happens?
>
> 2013-09-12 08:54:36,204 INFO
> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Sending
> out status for container: container_id { app_attempt_id { application_id {
> id: 2 cluster_timestamp: 1378990400253 } attemptId: 1 } id: 1 } state:
> C_RUNNING diagnostics: "" exit_status: -1000
> 2013-09-12 08:54:37,220 INFO
> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Sending
> out status for container: container_id { app_attempt_id { application_id {
> id: 2 cluster_timestamp: 1378990400253 } attemptId: 1 } id: 1 } state:
> C_RUNNING diagnostics: "" exit_status: -1000
> 2013-09-12 08:54:38,231 INFO
> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Sending
> out status for container: container_id { app_attempt_id { application_id {
> id: 2 cluster_timestamp: 1378990400253 } attemptId: 1 } id: 1 } state:
> C_RUNNING diagnostics: "" exit_status: -1000
> 2013-09-12 08:54:39,239 INFO
> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Sending
> out status for container: container_id { app_attempt_id { application_id {
> id: 2 cluster_timestamp: 1378990400253 } attemptId: 1 } id: 1 } state:
> C_RUNNING diagnostics: "" exit_status: -1000
> 2013-09-12 08:54:40,267 INFO
> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Sending
> out status for container: container_id { app_attempt_id { application_id {
> id: 2 cluster_timestamp: 1378990400253 } attemptId: 1 } id: 1 } state:
> C_RUNNING diagnostics: "" exit_status: -1000
> 2013-09-12 08:54:41,275 INFO
> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Sending
> out status for container: container_id { app_attempt_id { application_id {
> id: 2 cluster_timestamp: 1378990400253 } attemptId: 1 } id: 1 } state:
> C_RUNNING diagnostics: "" exit_status: -1000
> 2013-09-12 08:54:42,283 INFO
> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Sending
> out status for container: container_id { app_attempt_id { application_id {
> id: 2 cluster_timestamp: 1378990400253 } attemptId: 1 } id: 1 } state:
> C_RUNNING diagnostics: "" exit_status: -1000
> 2013-09-12 08:54:43,289 INFO
> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Sending
> out status for container: container_id { app_attempt_id { application_id {
> id: 2 cluster_timestamp: 1378990400253 } attemptId: 1 } id: 1 } state:
> C_RUNNING diagnostics: "" exit_status: -1000
>
>
> Thanks,
> Kishore
>

-- 
CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to 
which it is addressed and may contain information that is confidential, 
privileged and exempt from disclosure under applicable law. If the reader 
of this message is not the intended recipient, you are hereby notified that 
any printing, copying, dissemination, distribution, disclosure or 
forwarding of this communication is strictly prohibited. If you have 
received this communication in error, please contact the sender immediately 
and delete it from your system. Thank You.

Re: Container allocation fails randomly

Posted by Omkar Joshi <oj...@hortonworks.com>.

Can you give more information? logs (complete) will help a lot around this
time frame. Are the containers getting assigned via scheduler? is it
failing when node manager tries to start container? I clearly see the
diagnostic message is empty but do you see anything in NM logs? Also if
there were running containers on the machine before launching new ones..
then are they killed? or they are still hanging around? can you also try
applying patch "https://issues.apache.org/jira/browse/YARN-1053" ? and
check if you can see any message?

Thanks,
Omkar Joshi
*Hortonworks Inc.* <http://www.hortonworks.com>


On Thu, Sep 12, 2013 at 6:15 AM, Krishna Kishore Bonagiri <
write2kishore@gmail.com> wrote:

> Hi,
>   I am using 2.1.0-beta and have seen container allocation failing
> randomly even when running the same application in a loop. I know that the
> cluster has enough resources to give, because it gave the resources for the
> same application all the other times in the loop and ran it successfully.
>
>    I have observed a lot of the following kind of messages in the node
> manager's log whenever such failure happens, any clues as to why it happens?
>
> 2013-09-12 08:54:36,204 INFO
> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Sending
> out status for container: container_id { app_attempt_id { application_id {
> id: 2 cluster_timestamp: 1378990400253 } attemptId: 1 } id: 1 } state:
> C_RUNNING diagnostics: "" exit_status: -1000
> 2013-09-12 08:54:37,220 INFO
> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Sending
> out status for container: container_id { app_attempt_id { application_id {
> id: 2 cluster_timestamp: 1378990400253 } attemptId: 1 } id: 1 } state:
> C_RUNNING diagnostics: "" exit_status: -1000
> 2013-09-12 08:54:38,231 INFO
> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Sending
> out status for container: container_id { app_attempt_id { application_id {
> id: 2 cluster_timestamp: 1378990400253 } attemptId: 1 } id: 1 } state:
> C_RUNNING diagnostics: "" exit_status: -1000
> 2013-09-12 08:54:39,239 INFO
> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Sending
> out status for container: container_id { app_attempt_id { application_id {
> id: 2 cluster_timestamp: 1378990400253 } attemptId: 1 } id: 1 } state:
> C_RUNNING diagnostics: "" exit_status: -1000
> 2013-09-12 08:54:40,267 INFO
> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Sending
> out status for container: container_id { app_attempt_id { application_id {
> id: 2 cluster_timestamp: 1378990400253 } attemptId: 1 } id: 1 } state:
> C_RUNNING diagnostics: "" exit_status: -1000
> 2013-09-12 08:54:41,275 INFO
> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Sending
> out status for container: container_id { app_attempt_id { application_id {
> id: 2 cluster_timestamp: 1378990400253 } attemptId: 1 } id: 1 } state:
> C_RUNNING diagnostics: "" exit_status: -1000
> 2013-09-12 08:54:42,283 INFO
> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Sending
> out status for container: container_id { app_attempt_id { application_id {
> id: 2 cluster_timestamp: 1378990400253 } attemptId: 1 } id: 1 } state:
> C_RUNNING diagnostics: "" exit_status: -1000
> 2013-09-12 08:54:43,289 INFO
> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Sending
> out status for container: container_id { app_attempt_id { application_id {
> id: 2 cluster_timestamp: 1378990400253 } attemptId: 1 } id: 1 } state:
> C_RUNNING diagnostics: "" exit_status: -1000
>
>
> Thanks,
> Kishore
>

-- 
CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to 
which it is addressed and may contain information that is confidential, 
privileged and exempt from disclosure under applicable law. If the reader 
of this message is not the intended recipient, you are hereby notified that 
any printing, copying, dissemination, distribution, disclosure or 
forwarding of this communication is strictly prohibited. If you have 
received this communication in error, please contact the sender immediately 
and delete it from your system. Thank You.

Re: Container allocation fails randomly

Posted by Omkar Joshi <oj...@hortonworks.com>.

Can you give more information? logs (complete) will help a lot around this
time frame. Are the containers getting assigned via scheduler? is it
failing when node manager tries to start container? I clearly see the
diagnostic message is empty but do you see anything in NM logs? Also if
there were running containers on the machine before launching new ones..
then are they killed? or they are still hanging around? can you also try
applying patch "https://issues.apache.org/jira/browse/YARN-1053" ? and
check if you can see any message?

Thanks,
Omkar Joshi
*Hortonworks Inc.* <http://www.hortonworks.com>


On Thu, Sep 12, 2013 at 6:15 AM, Krishna Kishore Bonagiri <
write2kishore@gmail.com> wrote:

> Hi,
>   I am using 2.1.0-beta and have seen container allocation failing
> randomly even when running the same application in a loop. I know that the
> cluster has enough resources to give, because it gave the resources for the
> same application all the other times in the loop and ran it successfully.
>
>    I have observed a lot of the following kind of messages in the node
> manager's log whenever such failure happens, any clues as to why it happens?
>
> 2013-09-12 08:54:36,204 INFO
> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Sending
> out status for container: container_id { app_attempt_id { application_id {
> id: 2 cluster_timestamp: 1378990400253 } attemptId: 1 } id: 1 } state:
> C_RUNNING diagnostics: "" exit_status: -1000
> 2013-09-12 08:54:37,220 INFO
> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Sending
> out status for container: container_id { app_attempt_id { application_id {
> id: 2 cluster_timestamp: 1378990400253 } attemptId: 1 } id: 1 } state:
> C_RUNNING diagnostics: "" exit_status: -1000
> 2013-09-12 08:54:38,231 INFO
> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Sending
> out status for container: container_id { app_attempt_id { application_id {
> id: 2 cluster_timestamp: 1378990400253 } attemptId: 1 } id: 1 } state:
> C_RUNNING diagnostics: "" exit_status: -1000
> 2013-09-12 08:54:39,239 INFO
> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Sending
> out status for container: container_id { app_attempt_id { application_id {
> id: 2 cluster_timestamp: 1378990400253 } attemptId: 1 } id: 1 } state:
> C_RUNNING diagnostics: "" exit_status: -1000
> 2013-09-12 08:54:40,267 INFO
> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Sending
> out status for container: container_id { app_attempt_id { application_id {
> id: 2 cluster_timestamp: 1378990400253 } attemptId: 1 } id: 1 } state:
> C_RUNNING diagnostics: "" exit_status: -1000
> 2013-09-12 08:54:41,275 INFO
> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Sending
> out status for container: container_id { app_attempt_id { application_id {
> id: 2 cluster_timestamp: 1378990400253 } attemptId: 1 } id: 1 } state:
> C_RUNNING diagnostics: "" exit_status: -1000
> 2013-09-12 08:54:42,283 INFO
> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Sending
> out status for container: container_id { app_attempt_id { application_id {
> id: 2 cluster_timestamp: 1378990400253 } attemptId: 1 } id: 1 } state:
> C_RUNNING diagnostics: "" exit_status: -1000
> 2013-09-12 08:54:43,289 INFO
> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Sending
> out status for container: container_id { app_attempt_id { application_id {
> id: 2 cluster_timestamp: 1378990400253 } attemptId: 1 } id: 1 } state:
> C_RUNNING diagnostics: "" exit_status: -1000
>
>
> Thanks,
> Kishore
>

-- 
CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to 
which it is addressed and may contain information that is confidential, 
privileged and exempt from disclosure under applicable law. If the reader 
of this message is not the intended recipient, you are hereby notified that 
any printing, copying, dissemination, distribution, disclosure or 
forwarding of this communication is strictly prohibited. If you have 
received this communication in error, please contact the sender immediately 
and delete it from your system. Thank You.