You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by Manish Malhotra <ma...@gmail.com> on 2013/11/03 09:21:21 UTC

Namenode / Cluster scaling issues in AWS environment

Hi All,

I'm facing issues in scaling a Hadoop cluster, I have following cluster
config.


1. AWS Infrastructure.
2. 400 DN
3. NN :
            120 gb memory, 10gb network,32 cores
            dfs.namenode.handler.count = 128
             ipc queue size = 128 ( default)
4. DN: 15.5 gb memory. 1 gb network, 8cores
5. Hadoop version: 1.0.2


Problem: Sometime NN becomes unstable, and started showing DN's as down.
But actually DNs are running.
I have seen "Socket timeout exception" from DN and also " xrecievers
Exception".
Looks like the NN is busy for that time, and suddenly it start loosing the
hearbeat of DNs.
Once it sees DNs are down, it start replicating blocks to other nodes, but
then again more nodes become unavailable and again it tries to replicate
those blocks.
This is like a cycle where NN trapped, and not able to come out.
NN looks good from Memory and CPU usage point of view.
Maximum it uses 150% CPU, I believe 1.0.2 version is not using multi cores,
and uses single core only

Potential Reasons:

1. Small files, we have lots and lots of small files, we are working on it.
2. AWS Infra is not reliable, so should increase the
"datanode.recheck.interval" property to give more time before declaring DN
as dead.
3. Lots of connections to NN from clients and MR jobs.
4. DNs have issues in terms of Memory / Threads, so that its actually not
even connecting to the NN.
But have not seen the OOM issue, yet.

5. NN threaddump at the time of issue, showing all the Handler threads are
in waiting for lock state.

If anybody has similar experience with Hadoop on AWS or any infra and can
give some input that will be great.

Regards,
Manish

Re: Namenode / Cluster scaling issues in AWS environment

Posted by Chris Mawata <ch...@gmail.com>.
You might also consider  federation.
Chris


On 11/3/2013 3:21 AM, Manish Malhotra wrote:
> Hi All,
>
> I'm facing issues in scaling a Hadoop cluster, I have following 
> cluster config.
>
>
> 1. AWS Infrastructure.
> 2. 400 DN
> 3. NN :
>             120 gb memory, 10gb network,32 cores
>             dfs.namenode.handler.count = 128
>              ipc queue size = 128 ( default)
> 4. DN: 15.5 gb memory. 1 gb network, 8cores
> 5. Hadoop version: 1.0.2
>
>
> Problem: Sometime NN becomes unstable, and started showing DN's as down.
> But actually DNs are running.
> I have seen "Socket timeout exception" from DN and also " xrecievers 
> Exception".
> Looks like the NN is busy for that time, and suddenly it start loosing 
> the hearbeat of DNs.
> Once it sees DNs are down, it start replicating blocks to other nodes, 
> but then again more nodes become unavailable and again it tries to 
> replicate those blocks.
> This is like a cycle where NN trapped, and not able to come out.
> NN looks good from Memory and CPU usage point of view.
> Maximum it uses 150% CPU, I believe 1.0.2 version is not using multi 
> cores, and uses single core only
>
> Potential Reasons:
>
> 1. Small files, we have lots and lots of small files, we are working 
> on it.
> 2. AWS Infra is not reliable, so should increase the 
> "datanode.recheck.interval" property to give more time before 
> declaring DN as dead.
> 3. Lots of connections to NN from clients and MR jobs.
> 4. DNs have issues in terms of Memory / Threads, so that its actually 
> not even connecting to the NN.
> But have not seen the OOM issue, yet.
>
> 5. NN threaddump at the time of issue, showing all the Handler threads 
> are in waiting for lock state.
>
> If anybody has similar experience with Hadoop on AWS or any infra and 
> can give some input that will be great.
>
> Regards,
> Manish
>


Re: Namenode / Cluster scaling issues in AWS environment

Posted by Chris Mawata <ch...@gmail.com>.
You might also consider  federation.
Chris


On 11/3/2013 3:21 AM, Manish Malhotra wrote:
> Hi All,
>
> I'm facing issues in scaling a Hadoop cluster, I have following 
> cluster config.
>
>
> 1. AWS Infrastructure.
> 2. 400 DN
> 3. NN :
>             120 gb memory, 10gb network,32 cores
>             dfs.namenode.handler.count = 128
>              ipc queue size = 128 ( default)
> 4. DN: 15.5 gb memory. 1 gb network, 8cores
> 5. Hadoop version: 1.0.2
>
>
> Problem: Sometime NN becomes unstable, and started showing DN's as down.
> But actually DNs are running.
> I have seen "Socket timeout exception" from DN and also " xrecievers 
> Exception".
> Looks like the NN is busy for that time, and suddenly it start loosing 
> the hearbeat of DNs.
> Once it sees DNs are down, it start replicating blocks to other nodes, 
> but then again more nodes become unavailable and again it tries to 
> replicate those blocks.
> This is like a cycle where NN trapped, and not able to come out.
> NN looks good from Memory and CPU usage point of view.
> Maximum it uses 150% CPU, I believe 1.0.2 version is not using multi 
> cores, and uses single core only
>
> Potential Reasons:
>
> 1. Small files, we have lots and lots of small files, we are working 
> on it.
> 2. AWS Infra is not reliable, so should increase the 
> "datanode.recheck.interval" property to give more time before 
> declaring DN as dead.
> 3. Lots of connections to NN from clients and MR jobs.
> 4. DNs have issues in terms of Memory / Threads, so that its actually 
> not even connecting to the NN.
> But have not seen the OOM issue, yet.
>
> 5. NN threaddump at the time of issue, showing all the Handler threads 
> are in waiting for lock state.
>
> If anybody has similar experience with Hadoop on AWS or any infra and 
> can give some input that will be great.
>
> Regards,
> Manish
>


Re: Namenode / Cluster scaling issues in AWS environment

Posted by Chris Mawata <ch...@gmail.com>.
You might also consider  federation.
Chris


On 11/3/2013 3:21 AM, Manish Malhotra wrote:
> Hi All,
>
> I'm facing issues in scaling a Hadoop cluster, I have following 
> cluster config.
>
>
> 1. AWS Infrastructure.
> 2. 400 DN
> 3. NN :
>             120 gb memory, 10gb network,32 cores
>             dfs.namenode.handler.count = 128
>              ipc queue size = 128 ( default)
> 4. DN: 15.5 gb memory. 1 gb network, 8cores
> 5. Hadoop version: 1.0.2
>
>
> Problem: Sometime NN becomes unstable, and started showing DN's as down.
> But actually DNs are running.
> I have seen "Socket timeout exception" from DN and also " xrecievers 
> Exception".
> Looks like the NN is busy for that time, and suddenly it start loosing 
> the hearbeat of DNs.
> Once it sees DNs are down, it start replicating blocks to other nodes, 
> but then again more nodes become unavailable and again it tries to 
> replicate those blocks.
> This is like a cycle where NN trapped, and not able to come out.
> NN looks good from Memory and CPU usage point of view.
> Maximum it uses 150% CPU, I believe 1.0.2 version is not using multi 
> cores, and uses single core only
>
> Potential Reasons:
>
> 1. Small files, we have lots and lots of small files, we are working 
> on it.
> 2. AWS Infra is not reliable, so should increase the 
> "datanode.recheck.interval" property to give more time before 
> declaring DN as dead.
> 3. Lots of connections to NN from clients and MR jobs.
> 4. DNs have issues in terms of Memory / Threads, so that its actually 
> not even connecting to the NN.
> But have not seen the OOM issue, yet.
>
> 5. NN threaddump at the time of issue, showing all the Handler threads 
> are in waiting for lock state.
>
> If anybody has similar experience with Hadoop on AWS or any infra and 
> can give some input that will be great.
>
> Regards,
> Manish
>


Re: Namenode / Cluster scaling issues in AWS environment

Posted by Chris Mawata <ch...@gmail.com>.
You might also consider  federation.
Chris


On 11/3/2013 3:21 AM, Manish Malhotra wrote:
> Hi All,
>
> I'm facing issues in scaling a Hadoop cluster, I have following 
> cluster config.
>
>
> 1. AWS Infrastructure.
> 2. 400 DN
> 3. NN :
>             120 gb memory, 10gb network,32 cores
>             dfs.namenode.handler.count = 128
>              ipc queue size = 128 ( default)
> 4. DN: 15.5 gb memory. 1 gb network, 8cores
> 5. Hadoop version: 1.0.2
>
>
> Problem: Sometime NN becomes unstable, and started showing DN's as down.
> But actually DNs are running.
> I have seen "Socket timeout exception" from DN and also " xrecievers 
> Exception".
> Looks like the NN is busy for that time, and suddenly it start loosing 
> the hearbeat of DNs.
> Once it sees DNs are down, it start replicating blocks to other nodes, 
> but then again more nodes become unavailable and again it tries to 
> replicate those blocks.
> This is like a cycle where NN trapped, and not able to come out.
> NN looks good from Memory and CPU usage point of view.
> Maximum it uses 150% CPU, I believe 1.0.2 version is not using multi 
> cores, and uses single core only
>
> Potential Reasons:
>
> 1. Small files, we have lots and lots of small files, we are working 
> on it.
> 2. AWS Infra is not reliable, so should increase the 
> "datanode.recheck.interval" property to give more time before 
> declaring DN as dead.
> 3. Lots of connections to NN from clients and MR jobs.
> 4. DNs have issues in terms of Memory / Threads, so that its actually 
> not even connecting to the NN.
> But have not seen the OOM issue, yet.
>
> 5. NN threaddump at the time of issue, showing all the Handler threads 
> are in waiting for lock state.
>
> If anybody has similar experience with Hadoop on AWS or any infra and 
> can give some input that will be great.
>
> Regards,
> Manish
>