You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hadoop.apache.org by Casey K <ro...@gmail.com> on 2014/03/31 21:32:32 UTC

YARN App Master logs and other qns

Hello,

I am fairly new to the Hadoop framework. So I appreciate your patience in
case my email has not entirely correct or the terminology is wrong. I have
a working installation. However, I am facing a few issues:

1) I have run PI example a number of times. The number of slave nodes used
is 4. Most times the runtime is about 31 secs. Other times, i varies widely
and goes up to 650 secs. What could be causing this? This is a dedicated
cluster with no other workloads

2) "nodemanager did not stop gracefully after 5 seconds: killing with kill
-9" Every time during shutdown, the nodemanager is forcibly killed because
it doesnt respond in 5 seconds. I dug through the logs and dont find any
thing off. One thing I found is noted in (3).

3) I see errors as follows: "2014-03-31 12:27:26,975 ERROR [RMCommunicator
Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator:
Container complete event for unknown container id
container_1396286812424_0001_01_000042" My searches indicate this is
because the connection to the appmaster is lost. I cant seem to find where
the appmaster logs are

4) If Proxy server needed? I did not set the " yarn.web-proxy.address" and
so it never starts. My understand is that it starts as a part of RM in this
case.

5) RDMA based shuffle - Mellanox seems to have contributed code for RDMA
shuffle instead of HTTP. Is this part of YARN? If yes, how do I enable it?
Is UDA required for RDMA Shuffle.

6) If I want to provide support for a new file system, is there a tutorial
on what all needs to be implemented? I found that
org.apache.hadoop.fs.FileSystem is the class to extend. However, a sample
code or documentation would help.

Appreciate the help.

Regards,
Casey

Re: YARN App Master logs and other qns

Posted by Casey K <ro...@gmail.com>.
I was able to fix address item (2) below.

Looking through the logs, I noticed that the node manager initiated
shutdown but was killed before it could finish. So I increased the value
for YARN_STOP_TIMEOUT from default 5 secs to 10 secs and in some cases 30
secs. Is it normal to have longer than 10 sec timeouts?

On Mon, Mar 31, 2014 at 2:32 PM, Casey K <ro...@gmail.com> wrote:

> Hello,
>
> I am fairly new to the Hadoop framework. So I appreciate your patience in
> case my email has not entirely correct or the terminology is wrong. I have
> a working installation. However, I am facing a few issues:
>
> 1) I have run PI example a number of times. The number of slave nodes used
> is 4. Most times the runtime is about 31 secs. Other times, i varies widely
> and goes up to 650 secs. What could be causing this? This is a dedicated
> cluster with no other workloads
>
> 2) "nodemanager did not stop gracefully after 5 seconds: killing with kill
> -9" Every time during shutdown, the nodemanager is forcibly killed because
> it doesnt respond in 5 seconds. I dug through the logs and dont find any
> thing off. One thing I found is noted in (3).
>
> 3) I see errors as follows: "2014-03-31 12:27:26,975 ERROR [RMCommunicator
> Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator:
> Container complete event for unknown container id
> container_1396286812424_0001_01_000042" My searches indicate this is
> because the connection to the appmaster is lost. I cant seem to find where
> the appmaster logs are
>
> 4) If Proxy server needed? I did not set the " yarn.web-proxy.address" and
> so it never starts. My understand is that it starts as a part of RM in this
> case.
>
> 5) RDMA based shuffle - Mellanox seems to have contributed code for RDMA
> shuffle instead of HTTP. Is this part of YARN? If yes, how do I enable it?
> Is UDA required for RDMA Shuffle.
>
> 6) If I want to provide support for a new file system, is there a tutorial
> on what all needs to be implemented? I found that
> org.apache.hadoop.fs.FileSystem is the class to extend. However, a sample
> code or documentation would help.
>
> Appreciate the help.
>
> Regards,
> Casey
>

Re: YARN App Master logs and other qns

Posted by Casey K <ro...@gmail.com>.
I was able to fix address item (2) below.

Looking through the logs, I noticed that the node manager initiated
shutdown but was killed before it could finish. So I increased the value
for YARN_STOP_TIMEOUT from default 5 secs to 10 secs and in some cases 30
secs. Is it normal to have longer than 10 sec timeouts?

On Mon, Mar 31, 2014 at 2:32 PM, Casey K <ro...@gmail.com> wrote:

> Hello,
>
> I am fairly new to the Hadoop framework. So I appreciate your patience in
> case my email has not entirely correct or the terminology is wrong. I have
> a working installation. However, I am facing a few issues:
>
> 1) I have run PI example a number of times. The number of slave nodes used
> is 4. Most times the runtime is about 31 secs. Other times, i varies widely
> and goes up to 650 secs. What could be causing this? This is a dedicated
> cluster with no other workloads
>
> 2) "nodemanager did not stop gracefully after 5 seconds: killing with kill
> -9" Every time during shutdown, the nodemanager is forcibly killed because
> it doesnt respond in 5 seconds. I dug through the logs and dont find any
> thing off. One thing I found is noted in (3).
>
> 3) I see errors as follows: "2014-03-31 12:27:26,975 ERROR [RMCommunicator
> Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator:
> Container complete event for unknown container id
> container_1396286812424_0001_01_000042" My searches indicate this is
> because the connection to the appmaster is lost. I cant seem to find where
> the appmaster logs are
>
> 4) If Proxy server needed? I did not set the " yarn.web-proxy.address" and
> so it never starts. My understand is that it starts as a part of RM in this
> case.
>
> 5) RDMA based shuffle - Mellanox seems to have contributed code for RDMA
> shuffle instead of HTTP. Is this part of YARN? If yes, how do I enable it?
> Is UDA required for RDMA Shuffle.
>
> 6) If I want to provide support for a new file system, is there a tutorial
> on what all needs to be implemented? I found that
> org.apache.hadoop.fs.FileSystem is the class to extend. However, a sample
> code or documentation would help.
>
> Appreciate the help.
>
> Regards,
> Casey
>

Re: YARN App Master logs and other qns

Posted by Casey K <ro...@gmail.com>.
I was able to fix address item (2) below.

Looking through the logs, I noticed that the node manager initiated
shutdown but was killed before it could finish. So I increased the value
for YARN_STOP_TIMEOUT from default 5 secs to 10 secs and in some cases 30
secs. Is it normal to have longer than 10 sec timeouts?

On Mon, Mar 31, 2014 at 2:32 PM, Casey K <ro...@gmail.com> wrote:

> Hello,
>
> I am fairly new to the Hadoop framework. So I appreciate your patience in
> case my email has not entirely correct or the terminology is wrong. I have
> a working installation. However, I am facing a few issues:
>
> 1) I have run PI example a number of times. The number of slave nodes used
> is 4. Most times the runtime is about 31 secs. Other times, i varies widely
> and goes up to 650 secs. What could be causing this? This is a dedicated
> cluster with no other workloads
>
> 2) "nodemanager did not stop gracefully after 5 seconds: killing with kill
> -9" Every time during shutdown, the nodemanager is forcibly killed because
> it doesnt respond in 5 seconds. I dug through the logs and dont find any
> thing off. One thing I found is noted in (3).
>
> 3) I see errors as follows: "2014-03-31 12:27:26,975 ERROR [RMCommunicator
> Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator:
> Container complete event for unknown container id
> container_1396286812424_0001_01_000042" My searches indicate this is
> because the connection to the appmaster is lost. I cant seem to find where
> the appmaster logs are
>
> 4) If Proxy server needed? I did not set the " yarn.web-proxy.address" and
> so it never starts. My understand is that it starts as a part of RM in this
> case.
>
> 5) RDMA based shuffle - Mellanox seems to have contributed code for RDMA
> shuffle instead of HTTP. Is this part of YARN? If yes, how do I enable it?
> Is UDA required for RDMA Shuffle.
>
> 6) If I want to provide support for a new file system, is there a tutorial
> on what all needs to be implemented? I found that
> org.apache.hadoop.fs.FileSystem is the class to extend. However, a sample
> code or documentation would help.
>
> Appreciate the help.
>
> Regards,
> Casey
>

Re: YARN App Master logs and other qns

Posted by Casey K <ro...@gmail.com>.
I was able to fix address item (2) below.

Looking through the logs, I noticed that the node manager initiated
shutdown but was killed before it could finish. So I increased the value
for YARN_STOP_TIMEOUT from default 5 secs to 10 secs and in some cases 30
secs. Is it normal to have longer than 10 sec timeouts?

On Mon, Mar 31, 2014 at 2:32 PM, Casey K <ro...@gmail.com> wrote:

> Hello,
>
> I am fairly new to the Hadoop framework. So I appreciate your patience in
> case my email has not entirely correct or the terminology is wrong. I have
> a working installation. However, I am facing a few issues:
>
> 1) I have run PI example a number of times. The number of slave nodes used
> is 4. Most times the runtime is about 31 secs. Other times, i varies widely
> and goes up to 650 secs. What could be causing this? This is a dedicated
> cluster with no other workloads
>
> 2) "nodemanager did not stop gracefully after 5 seconds: killing with kill
> -9" Every time during shutdown, the nodemanager is forcibly killed because
> it doesnt respond in 5 seconds. I dug through the logs and dont find any
> thing off. One thing I found is noted in (3).
>
> 3) I see errors as follows: "2014-03-31 12:27:26,975 ERROR [RMCommunicator
> Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator:
> Container complete event for unknown container id
> container_1396286812424_0001_01_000042" My searches indicate this is
> because the connection to the appmaster is lost. I cant seem to find where
> the appmaster logs are
>
> 4) If Proxy server needed? I did not set the " yarn.web-proxy.address" and
> so it never starts. My understand is that it starts as a part of RM in this
> case.
>
> 5) RDMA based shuffle - Mellanox seems to have contributed code for RDMA
> shuffle instead of HTTP. Is this part of YARN? If yes, how do I enable it?
> Is UDA required for RDMA Shuffle.
>
> 6) If I want to provide support for a new file system, is there a tutorial
> on what all needs to be implemented? I found that
> org.apache.hadoop.fs.FileSystem is the class to extend. However, a sample
> code or documentation would help.
>
> Appreciate the help.
>
> Regards,
> Casey
>