You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hadoop.apache.org by Forrest Aldrich <fo...@gmail.com> on 2013/09/16 22:35:09 UTC

Resource limits with Hadoop and JVM

We recently experienced a couple of situations that brought one or more 
Hadoop nodes down (unresponsive).   One was related to a bug in a 
utility we use (ffmpeg) that was resolved by compiling a new version. 
The next, today, occurred after attempting to join a new node to the 
cluster.

A basic start of the (local) tasktracker and datanode did not work -- so 
based on reference, I issued: hadoop mradmin -refreshNodes, which was to 
be followed by hadoop dfsadmin -refreshNodes.    The load average 
literally jumped to 60 and the master (which also runs a slave) became 
unresponsive.

Seems to me that this should never happen.   But, looking around, I saw 
an article from Spotify which mentioned the need to set certain resource 
limits on the JVM as well as in the system itself (limits.conf, we run 
RHEL).    I (and we) are fairly new to Hadoop, so some of these issues 
are very new.

I wonder if some of the experts here might be able to comment on this 
issue - perhaps point out settings and other measures we can take to 
prevent this sort of incident in the future.

Our setup is not complicated.   Have 3 hadoop nodes, the first is also a 
master and a slave (has more resources, too).   The underlying system we 
do is split up tasks to ffmpeg  (which is another issue as it tends to 
eat resources, but so far with a recompile, we are good).   We have two 
more hardware nodes to add shortly.


Thanks!

Re: Resource limits with Hadoop and JVM

Posted by Forrest Aldrich <fo...@gmail.com>.

Yes, I mentioned below we're running RHEL.

In this case, when I went to add the node, I ran "hadoop mradmin 
-refreshNodes" (as user hadoop) and the master node went completely nuts 
- the system load jumped to 60 ("top" was frozen on the console) and 
required a hard reboot.

Whether or not the slave node I added had errors in the *.xml, this 
should never happen.  At least, I would like it if it never happened 
again ;-)

We're running:

java version "1.6.0_39"
Java(TM) SE Runtime Environment (build 1.6.0_39-b04)
Java HotSpot(TM) 64-Bit Server VM (build 20.14-b01, mixed mode)

Hadoop v1.0.1

Perhaps we ran into a bug?   I know we need to upgrade, but we're being 
very cautious about changes to the production environment.  If it works, 
don't fix it type of approach.



Thanks,

Forrest



On 9/16/13 5:04 PM, Vinod Kumar Vavilapalli wrote:
> I assume you are on Linux. Also assuming that your tasks are so 
> resource intensive that they are taking down nodes. You should enable 
> limits per task, see 
> http://hadoop.apache.org/docs/stable/cluster_setup.html#Memory+monitoring
>
> What it does is that jobs are now forced to up front provide their 
> resource requirements, and TTs enforce those limits.
>
> HTH
> +Vinod Kumar Vavilapalli
> Hortonworks Inc.
> http://hortonworks.com/
>
> On Sep 16, 2013, at 1:35 PM, Forrest Aldrich wrote:
>
>> We recently experienced a couple of situations that brought one or 
>> more Hadoop nodes down (unresponsive).   One was related to a bug in 
>> a utility we use (ffmpeg) that was resolved by compiling a new 
>> version. The next, today, occurred after attempting to join a new 
>> node to the cluster.
>>
>> A basic start of the (local) tasktracker and datanode did not work -- 
>> so based on reference, I issued: hadoop mradmin -refreshNodes, which 
>> was to be followed by hadoop dfsadmin -refreshNodes.    The load 
>> average literally jumped to 60 and the master (which also runs a 
>> slave) became unresponsive.
>>
>> Seems to me that this should never happen.   But, looking around, I 
>> saw an article from Spotify which mentioned the need to set certain 
>> resource limits on the JVM as well as in the system itself 
>> (limits.conf, we run RHEL).    I (and we) are fairly new to Hadoop, 
>> so some of these issues are very new.
>>
>> I wonder if some of the experts here might be able to comment on this 
>> issue - perhaps point out settings and other measures we can take to 
>> prevent this sort of incident in the future.
>>
>> Our setup is not complicated.   Have 3 hadoop nodes, the first is 
>> also a master and a slave (has more resources, too).   The underlying 
>> system we do is split up tasks to ffmpeg  (which is another issue as 
>> it tends to eat resources, but so far with a recompile, we are 
>> good).   We have two more hardware nodes to add shortly.
>>
>>
>> Thanks!
>
>
> CONFIDENTIALITY NOTICE
> NOTICE: This message is intended for the use of the individual or 
> entity to which it is addressed and may contain information that is 
> confidential, privileged and exempt from disclosure under applicable 
> law. If the reader of this message is not the intended recipient, you 
> are hereby notified that any printing, copying, dissemination, 
> distribution, disclosure or forwarding of this communication is 
> strictly prohibited. If you have received this communication in error, 
> please contact the sender immediately and delete it from your system. 
> Thank You.

Re: Resource limits with Hadoop and JVM

Posted by Forrest Aldrich <fo...@gmail.com>.

I wanted to elaborate on what happened.

A hadoop slave was added to a live cluster.   Turns out, I think the 
mapred-site.xml was not configured with the correct master host.  But 
alas, in any case when the commands were run:


  * |$ hadoop mradmin -refreshNodes|
  * |$ hadoop dfsadmin -refreshNodes|

||

The master went completely berserk, up to a system load of 60 where it 
froze.

This should never, ever happen -- no matter what the issue.   So what 
I'm trying to understand is how to prevent this while allowing 
hadoop/java to run about its business.

We are using an older version of Hadoop (1.0.1) so maybe we hit a bug, I 
can't really tell.

I read an article about Spotify experiencing issues like this and some 
of their approaches, but it's not clear which is which here (I'm a newbie).


Thanks.



On 9/16/13 5:04 PM, Vinod Kumar Vavilapalli wrote:
> I assume you are on Linux. Also assuming that your tasks are so 
> resource intensive that they are taking down nodes. You should enable 
> limits per task, see 
> http://hadoop.apache.org/docs/stable/cluster_setup.html#Memory+monitoring
>
> What it does is that jobs are now forced to up front provide their 
> resource requirements, and TTs enforce those limits.
>
> HTH
> +Vinod Kumar Vavilapalli
> Hortonworks Inc.
> http://hortonworks.com/
>
> On Sep 16, 2013, at 1:35 PM, Forrest Aldrich wrote:
>
>> We recently experienced a couple of situations that brought one or 
>> more Hadoop nodes down (unresponsive).   One was related to a bug in 
>> a utility we use (ffmpeg) that was resolved by compiling a new 
>> version. The next, today, occurred after attempting to join a new 
>> node to the cluster.
>>
>> A basic start of the (local) tasktracker and datanode did not work -- 
>> so based on reference, I issued: hadoop mradmin -refreshNodes, which 
>> was to be followed by hadoop dfsadmin -refreshNodes.    The load 
>> average literally jumped to 60 and the master (which also runs a 
>> slave) became unresponsive.
>>
>> Seems to me that this should never happen.   But, looking around, I 
>> saw an article from Spotify which mentioned the need to set certain 
>> resource limits on the JVM as well as in the system itself 
>> (limits.conf, we run RHEL).    I (and we) are fairly new to Hadoop, 
>> so some of these issues are very new.
>>
>> I wonder if some of the experts here might be able to comment on this 
>> issue - perhaps point out settings and other measures we can take to 
>> prevent this sort of incident in the future.
>>
>> Our setup is not complicated.   Have 3 hadoop nodes, the first is 
>> also a master and a slave (has more resources, too).   The underlying 
>> system we do is split up tasks to ffmpeg  (which is another issue as 
>> it tends to eat resources, but so far with a recompile, we are 
>> good).   We have two more hardware nodes to add shortly.
>>
>>
>> Thanks!
>
>
> CONFIDENTIALITY NOTICE
> NOTICE: This message is intended for the use of the individual or 
> entity to which it is addressed and may contain information that is 
> confidential, privileged and exempt from disclosure under applicable 
> law. If the reader of this message is not the intended recipient, you 
> are hereby notified that any printing, copying, dissemination, 
> distribution, disclosure or forwarding of this communication is 
> strictly prohibited. If you have received this communication in error, 
> please contact the sender immediately and delete it from your system. 
> Thank You.

Re: Resource limits with Hadoop and JVM

Posted by Forrest Aldrich <fo...@gmail.com>.

I wanted to elaborate on what happened.

A hadoop slave was added to a live cluster.   Turns out, I think the 
mapred-site.xml was not configured with the correct master host.  But 
alas, in any case when the commands were run:


  * |$ hadoop mradmin -refreshNodes|
  * |$ hadoop dfsadmin -refreshNodes|

||

The master went completely berserk, up to a system load of 60 where it 
froze.

This should never, ever happen -- no matter what the issue.   So what 
I'm trying to understand is how to prevent this while allowing 
hadoop/java to run about its business.

We are using an older version of Hadoop (1.0.1) so maybe we hit a bug, I 
can't really tell.

I read an article about Spotify experiencing issues like this and some 
of their approaches, but it's not clear which is which here (I'm a newbie).


Thanks.



On 9/16/13 5:04 PM, Vinod Kumar Vavilapalli wrote:
> I assume you are on Linux. Also assuming that your tasks are so 
> resource intensive that they are taking down nodes. You should enable 
> limits per task, see 
> http://hadoop.apache.org/docs/stable/cluster_setup.html#Memory+monitoring
>
> What it does is that jobs are now forced to up front provide their 
> resource requirements, and TTs enforce those limits.
>
> HTH
> +Vinod Kumar Vavilapalli
> Hortonworks Inc.
> http://hortonworks.com/
>
> On Sep 16, 2013, at 1:35 PM, Forrest Aldrich wrote:
>
>> We recently experienced a couple of situations that brought one or 
>> more Hadoop nodes down (unresponsive).   One was related to a bug in 
>> a utility we use (ffmpeg) that was resolved by compiling a new 
>> version. The next, today, occurred after attempting to join a new 
>> node to the cluster.
>>
>> A basic start of the (local) tasktracker and datanode did not work -- 
>> so based on reference, I issued: hadoop mradmin -refreshNodes, which 
>> was to be followed by hadoop dfsadmin -refreshNodes.    The load 
>> average literally jumped to 60 and the master (which also runs a 
>> slave) became unresponsive.
>>
>> Seems to me that this should never happen.   But, looking around, I 
>> saw an article from Spotify which mentioned the need to set certain 
>> resource limits on the JVM as well as in the system itself 
>> (limits.conf, we run RHEL).    I (and we) are fairly new to Hadoop, 
>> so some of these issues are very new.
>>
>> I wonder if some of the experts here might be able to comment on this 
>> issue - perhaps point out settings and other measures we can take to 
>> prevent this sort of incident in the future.
>>
>> Our setup is not complicated.   Have 3 hadoop nodes, the first is 
>> also a master and a slave (has more resources, too).   The underlying 
>> system we do is split up tasks to ffmpeg  (which is another issue as 
>> it tends to eat resources, but so far with a recompile, we are 
>> good).   We have two more hardware nodes to add shortly.
>>
>>
>> Thanks!
>
>
> CONFIDENTIALITY NOTICE
> NOTICE: This message is intended for the use of the individual or 
> entity to which it is addressed and may contain information that is 
> confidential, privileged and exempt from disclosure under applicable 
> law. If the reader of this message is not the intended recipient, you 
> are hereby notified that any printing, copying, dissemination, 
> distribution, disclosure or forwarding of this communication is 
> strictly prohibited. If you have received this communication in error, 
> please contact the sender immediately and delete it from your system. 
> Thank You.

Re: Resource limits with Hadoop and JVM

Posted by Forrest Aldrich <fo...@gmail.com>.

Yes, I mentioned below we're running RHEL.

In this case, when I went to add the node, I ran "hadoop mradmin 
-refreshNodes" (as user hadoop) and the master node went completely nuts 
- the system load jumped to 60 ("top" was frozen on the console) and 
required a hard reboot.

Whether or not the slave node I added had errors in the *.xml, this 
should never happen.  At least, I would like it if it never happened 
again ;-)

We're running:

java version "1.6.0_39"
Java(TM) SE Runtime Environment (build 1.6.0_39-b04)
Java HotSpot(TM) 64-Bit Server VM (build 20.14-b01, mixed mode)

Hadoop v1.0.1

Perhaps we ran into a bug?   I know we need to upgrade, but we're being 
very cautious about changes to the production environment.  If it works, 
don't fix it type of approach.



Thanks,

Forrest



On 9/16/13 5:04 PM, Vinod Kumar Vavilapalli wrote:
> I assume you are on Linux. Also assuming that your tasks are so 
> resource intensive that they are taking down nodes. You should enable 
> limits per task, see 
> http://hadoop.apache.org/docs/stable/cluster_setup.html#Memory+monitoring
>
> What it does is that jobs are now forced to up front provide their 
> resource requirements, and TTs enforce those limits.
>
> HTH
> +Vinod Kumar Vavilapalli
> Hortonworks Inc.
> http://hortonworks.com/
>
> On Sep 16, 2013, at 1:35 PM, Forrest Aldrich wrote:
>
>> We recently experienced a couple of situations that brought one or 
>> more Hadoop nodes down (unresponsive).   One was related to a bug in 
>> a utility we use (ffmpeg) that was resolved by compiling a new 
>> version. The next, today, occurred after attempting to join a new 
>> node to the cluster.
>>
>> A basic start of the (local) tasktracker and datanode did not work -- 
>> so based on reference, I issued: hadoop mradmin -refreshNodes, which 
>> was to be followed by hadoop dfsadmin -refreshNodes.    The load 
>> average literally jumped to 60 and the master (which also runs a 
>> slave) became unresponsive.
>>
>> Seems to me that this should never happen.   But, looking around, I 
>> saw an article from Spotify which mentioned the need to set certain 
>> resource limits on the JVM as well as in the system itself 
>> (limits.conf, we run RHEL).    I (and we) are fairly new to Hadoop, 
>> so some of these issues are very new.
>>
>> I wonder if some of the experts here might be able to comment on this 
>> issue - perhaps point out settings and other measures we can take to 
>> prevent this sort of incident in the future.
>>
>> Our setup is not complicated.   Have 3 hadoop nodes, the first is 
>> also a master and a slave (has more resources, too).   The underlying 
>> system we do is split up tasks to ffmpeg  (which is another issue as 
>> it tends to eat resources, but so far with a recompile, we are 
>> good).   We have two more hardware nodes to add shortly.
>>
>>
>> Thanks!
>
>
> CONFIDENTIALITY NOTICE
> NOTICE: This message is intended for the use of the individual or 
> entity to which it is addressed and may contain information that is 
> confidential, privileged and exempt from disclosure under applicable 
> law. If the reader of this message is not the intended recipient, you 
> are hereby notified that any printing, copying, dissemination, 
> distribution, disclosure or forwarding of this communication is 
> strictly prohibited. If you have received this communication in error, 
> please contact the sender immediately and delete it from your system. 
> Thank You.

Re: Resource limits with Hadoop and JVM

Posted by Forrest Aldrich <fo...@gmail.com>.

Yes, I mentioned below we're running RHEL.

In this case, when I went to add the node, I ran "hadoop mradmin 
-refreshNodes" (as user hadoop) and the master node went completely nuts 
- the system load jumped to 60 ("top" was frozen on the console) and 
required a hard reboot.

Whether or not the slave node I added had errors in the *.xml, this 
should never happen.  At least, I would like it if it never happened 
again ;-)

We're running:

java version "1.6.0_39"
Java(TM) SE Runtime Environment (build 1.6.0_39-b04)
Java HotSpot(TM) 64-Bit Server VM (build 20.14-b01, mixed mode)

Hadoop v1.0.1

Perhaps we ran into a bug?   I know we need to upgrade, but we're being 
very cautious about changes to the production environment.  If it works, 
don't fix it type of approach.



Thanks,

Forrest



On 9/16/13 5:04 PM, Vinod Kumar Vavilapalli wrote:
> I assume you are on Linux. Also assuming that your tasks are so 
> resource intensive that they are taking down nodes. You should enable 
> limits per task, see 
> http://hadoop.apache.org/docs/stable/cluster_setup.html#Memory+monitoring
>
> What it does is that jobs are now forced to up front provide their 
> resource requirements, and TTs enforce those limits.
>
> HTH
> +Vinod Kumar Vavilapalli
> Hortonworks Inc.
> http://hortonworks.com/
>
> On Sep 16, 2013, at 1:35 PM, Forrest Aldrich wrote:
>
>> We recently experienced a couple of situations that brought one or 
>> more Hadoop nodes down (unresponsive).   One was related to a bug in 
>> a utility we use (ffmpeg) that was resolved by compiling a new 
>> version. The next, today, occurred after attempting to join a new 
>> node to the cluster.
>>
>> A basic start of the (local) tasktracker and datanode did not work -- 
>> so based on reference, I issued: hadoop mradmin -refreshNodes, which 
>> was to be followed by hadoop dfsadmin -refreshNodes.    The load 
>> average literally jumped to 60 and the master (which also runs a 
>> slave) became unresponsive.
>>
>> Seems to me that this should never happen.   But, looking around, I 
>> saw an article from Spotify which mentioned the need to set certain 
>> resource limits on the JVM as well as in the system itself 
>> (limits.conf, we run RHEL).    I (and we) are fairly new to Hadoop, 
>> so some of these issues are very new.
>>
>> I wonder if some of the experts here might be able to comment on this 
>> issue - perhaps point out settings and other measures we can take to 
>> prevent this sort of incident in the future.
>>
>> Our setup is not complicated.   Have 3 hadoop nodes, the first is 
>> also a master and a slave (has more resources, too).   The underlying 
>> system we do is split up tasks to ffmpeg  (which is another issue as 
>> it tends to eat resources, but so far with a recompile, we are 
>> good).   We have two more hardware nodes to add shortly.
>>
>>
>> Thanks!
>
>
> CONFIDENTIALITY NOTICE
> NOTICE: This message is intended for the use of the individual or 
> entity to which it is addressed and may contain information that is 
> confidential, privileged and exempt from disclosure under applicable 
> law. If the reader of this message is not the intended recipient, you 
> are hereby notified that any printing, copying, dissemination, 
> distribution, disclosure or forwarding of this communication is 
> strictly prohibited. If you have received this communication in error, 
> please contact the sender immediately and delete it from your system. 
> Thank You.

Re: Resource limits with Hadoop and JVM

Posted by Forrest Aldrich <fo...@gmail.com>.

I wanted to elaborate on what happened.

A hadoop slave was added to a live cluster.   Turns out, I think the 
mapred-site.xml was not configured with the correct master host.  But 
alas, in any case when the commands were run:


  * |$ hadoop mradmin -refreshNodes|
  * |$ hadoop dfsadmin -refreshNodes|

||

The master went completely berserk, up to a system load of 60 where it 
froze.

This should never, ever happen -- no matter what the issue.   So what 
I'm trying to understand is how to prevent this while allowing 
hadoop/java to run about its business.

We are using an older version of Hadoop (1.0.1) so maybe we hit a bug, I 
can't really tell.

I read an article about Spotify experiencing issues like this and some 
of their approaches, but it's not clear which is which here (I'm a newbie).


Thanks.



On 9/16/13 5:04 PM, Vinod Kumar Vavilapalli wrote:
> I assume you are on Linux. Also assuming that your tasks are so 
> resource intensive that they are taking down nodes. You should enable 
> limits per task, see 
> http://hadoop.apache.org/docs/stable/cluster_setup.html#Memory+monitoring
>
> What it does is that jobs are now forced to up front provide their 
> resource requirements, and TTs enforce those limits.
>
> HTH
> +Vinod Kumar Vavilapalli
> Hortonworks Inc.
> http://hortonworks.com/
>
> On Sep 16, 2013, at 1:35 PM, Forrest Aldrich wrote:
>
>> We recently experienced a couple of situations that brought one or 
>> more Hadoop nodes down (unresponsive).   One was related to a bug in 
>> a utility we use (ffmpeg) that was resolved by compiling a new 
>> version. The next, today, occurred after attempting to join a new 
>> node to the cluster.
>>
>> A basic start of the (local) tasktracker and datanode did not work -- 
>> so based on reference, I issued: hadoop mradmin -refreshNodes, which 
>> was to be followed by hadoop dfsadmin -refreshNodes.    The load 
>> average literally jumped to 60 and the master (which also runs a 
>> slave) became unresponsive.
>>
>> Seems to me that this should never happen.   But, looking around, I 
>> saw an article from Spotify which mentioned the need to set certain 
>> resource limits on the JVM as well as in the system itself 
>> (limits.conf, we run RHEL).    I (and we) are fairly new to Hadoop, 
>> so some of these issues are very new.
>>
>> I wonder if some of the experts here might be able to comment on this 
>> issue - perhaps point out settings and other measures we can take to 
>> prevent this sort of incident in the future.
>>
>> Our setup is not complicated.   Have 3 hadoop nodes, the first is 
>> also a master and a slave (has more resources, too).   The underlying 
>> system we do is split up tasks to ffmpeg  (which is another issue as 
>> it tends to eat resources, but so far with a recompile, we are 
>> good).   We have two more hardware nodes to add shortly.
>>
>>
>> Thanks!
>
>
> CONFIDENTIALITY NOTICE
> NOTICE: This message is intended for the use of the individual or 
> entity to which it is addressed and may contain information that is 
> confidential, privileged and exempt from disclosure under applicable 
> law. If the reader of this message is not the intended recipient, you 
> are hereby notified that any printing, copying, dissemination, 
> distribution, disclosure or forwarding of this communication is 
> strictly prohibited. If you have received this communication in error, 
> please contact the sender immediately and delete it from your system. 
> Thank You.

Re: Resource limits with Hadoop and JVM

Posted by Forrest Aldrich <fo...@gmail.com>.

Yes, I mentioned below we're running RHEL.

In this case, when I went to add the node, I ran "hadoop mradmin 
-refreshNodes" (as user hadoop) and the master node went completely nuts 
- the system load jumped to 60 ("top" was frozen on the console) and 
required a hard reboot.

Whether or not the slave node I added had errors in the *.xml, this 
should never happen.  At least, I would like it if it never happened 
again ;-)

We're running:

java version "1.6.0_39"
Java(TM) SE Runtime Environment (build 1.6.0_39-b04)
Java HotSpot(TM) 64-Bit Server VM (build 20.14-b01, mixed mode)

Hadoop v1.0.1

Perhaps we ran into a bug?   I know we need to upgrade, but we're being 
very cautious about changes to the production environment.  If it works, 
don't fix it type of approach.



Thanks,

Forrest



On 9/16/13 5:04 PM, Vinod Kumar Vavilapalli wrote:
> I assume you are on Linux. Also assuming that your tasks are so 
> resource intensive that they are taking down nodes. You should enable 
> limits per task, see 
> http://hadoop.apache.org/docs/stable/cluster_setup.html#Memory+monitoring
>
> What it does is that jobs are now forced to up front provide their 
> resource requirements, and TTs enforce those limits.
>
> HTH
> +Vinod Kumar Vavilapalli
> Hortonworks Inc.
> http://hortonworks.com/
>
> On Sep 16, 2013, at 1:35 PM, Forrest Aldrich wrote:
>
>> We recently experienced a couple of situations that brought one or 
>> more Hadoop nodes down (unresponsive).   One was related to a bug in 
>> a utility we use (ffmpeg) that was resolved by compiling a new 
>> version. The next, today, occurred after attempting to join a new 
>> node to the cluster.
>>
>> A basic start of the (local) tasktracker and datanode did not work -- 
>> so based on reference, I issued: hadoop mradmin -refreshNodes, which 
>> was to be followed by hadoop dfsadmin -refreshNodes.    The load 
>> average literally jumped to 60 and the master (which also runs a 
>> slave) became unresponsive.
>>
>> Seems to me that this should never happen.   But, looking around, I 
>> saw an article from Spotify which mentioned the need to set certain 
>> resource limits on the JVM as well as in the system itself 
>> (limits.conf, we run RHEL).    I (and we) are fairly new to Hadoop, 
>> so some of these issues are very new.
>>
>> I wonder if some of the experts here might be able to comment on this 
>> issue - perhaps point out settings and other measures we can take to 
>> prevent this sort of incident in the future.
>>
>> Our setup is not complicated.   Have 3 hadoop nodes, the first is 
>> also a master and a slave (has more resources, too).   The underlying 
>> system we do is split up tasks to ffmpeg  (which is another issue as 
>> it tends to eat resources, but so far with a recompile, we are 
>> good).   We have two more hardware nodes to add shortly.
>>
>>
>> Thanks!
>
>
> CONFIDENTIALITY NOTICE
> NOTICE: This message is intended for the use of the individual or 
> entity to which it is addressed and may contain information that is 
> confidential, privileged and exempt from disclosure under applicable 
> law. If the reader of this message is not the intended recipient, you 
> are hereby notified that any printing, copying, dissemination, 
> distribution, disclosure or forwarding of this communication is 
> strictly prohibited. If you have received this communication in error, 
> please contact the sender immediately and delete it from your system. 
> Thank You.

Re: Resource limits with Hadoop and JVM

Posted by Forrest Aldrich <fo...@gmail.com>.

I wanted to elaborate on what happened.

A hadoop slave was added to a live cluster.   Turns out, I think the 
mapred-site.xml was not configured with the correct master host.  But 
alas, in any case when the commands were run:


  * |$ hadoop mradmin -refreshNodes|
  * |$ hadoop dfsadmin -refreshNodes|

||

The master went completely berserk, up to a system load of 60 where it 
froze.

This should never, ever happen -- no matter what the issue.   So what 
I'm trying to understand is how to prevent this while allowing 
hadoop/java to run about its business.

We are using an older version of Hadoop (1.0.1) so maybe we hit a bug, I 
can't really tell.

I read an article about Spotify experiencing issues like this and some 
of their approaches, but it's not clear which is which here (I'm a newbie).


Thanks.



On 9/16/13 5:04 PM, Vinod Kumar Vavilapalli wrote:
> I assume you are on Linux. Also assuming that your tasks are so 
> resource intensive that they are taking down nodes. You should enable 
> limits per task, see 
> http://hadoop.apache.org/docs/stable/cluster_setup.html#Memory+monitoring
>
> What it does is that jobs are now forced to up front provide their 
> resource requirements, and TTs enforce those limits.
>
> HTH
> +Vinod Kumar Vavilapalli
> Hortonworks Inc.
> http://hortonworks.com/
>
> On Sep 16, 2013, at 1:35 PM, Forrest Aldrich wrote:
>
>> We recently experienced a couple of situations that brought one or 
>> more Hadoop nodes down (unresponsive).   One was related to a bug in 
>> a utility we use (ffmpeg) that was resolved by compiling a new 
>> version. The next, today, occurred after attempting to join a new 
>> node to the cluster.
>>
>> A basic start of the (local) tasktracker and datanode did not work -- 
>> so based on reference, I issued: hadoop mradmin -refreshNodes, which 
>> was to be followed by hadoop dfsadmin -refreshNodes.    The load 
>> average literally jumped to 60 and the master (which also runs a 
>> slave) became unresponsive.
>>
>> Seems to me that this should never happen.   But, looking around, I 
>> saw an article from Spotify which mentioned the need to set certain 
>> resource limits on the JVM as well as in the system itself 
>> (limits.conf, we run RHEL).    I (and we) are fairly new to Hadoop, 
>> so some of these issues are very new.
>>
>> I wonder if some of the experts here might be able to comment on this 
>> issue - perhaps point out settings and other measures we can take to 
>> prevent this sort of incident in the future.
>>
>> Our setup is not complicated.   Have 3 hadoop nodes, the first is 
>> also a master and a slave (has more resources, too).   The underlying 
>> system we do is split up tasks to ffmpeg  (which is another issue as 
>> it tends to eat resources, but so far with a recompile, we are 
>> good).   We have two more hardware nodes to add shortly.
>>
>>
>> Thanks!
>
>
> CONFIDENTIALITY NOTICE
> NOTICE: This message is intended for the use of the individual or 
> entity to which it is addressed and may contain information that is 
> confidential, privileged and exempt from disclosure under applicable 
> law. If the reader of this message is not the intended recipient, you 
> are hereby notified that any printing, copying, dissemination, 
> distribution, disclosure or forwarding of this communication is 
> strictly prohibited. If you have received this communication in error, 
> please contact the sender immediately and delete it from your system. 
> Thank You.

Re: Resource limits with Hadoop and JVM

Posted by Vinod Kumar Vavilapalli <vi...@apache.org>.

I assume you are on Linux. Also assuming that your tasks are so resource intensive that they are taking down nodes. You should enable limits per task, see http://hadoop.apache.org/docs/stable/cluster_setup.html#Memory+monitoring

What it does is that jobs are now forced to up front provide their resource requirements, and TTs enforce those limits.

HTH
+Vinod Kumar Vavilapalli
Hortonworks Inc.
http://hortonworks.com/

On Sep 16, 2013, at 1:35 PM, Forrest Aldrich wrote:

> We recently experienced a couple of situations that brought one or more Hadoop nodes down (unresponsive).   One was related to a bug in a utility we use (ffmpeg) that was resolved by compiling a new version. The next, today, occurred after attempting to join a new node to the cluster.   
> 
> A basic start of the (local) tasktracker and datanode did not work -- so based on reference, I issued:  hadoop mradmin -refreshNodes, which was to be followed by hadoop dfsadmin -refreshNodes.    The load average literally jumped to 60 and the master (which also runs a slave) became unresponsive.
> 
> Seems to me that this should never happen.   But, looking around, I saw an article from Spotify which mentioned the need to set certain resource limits on the JVM as well as in the system itself (limits.conf, we run RHEL).    I (and we) are fairly new to Hadoop, so some of these issues are very new.
> 
> I wonder if some of the experts here might be able to comment on this issue - perhaps point out settings and other measures we can take to prevent this sort of incident in the future.
> 
> Our setup is not complicated.   Have 3 hadoop nodes, the first is also a master and a slave (has more resources, too).   The underlying system we do is split up tasks to ffmpeg  (which is another issue as it tends to eat resources, but so far with a recompile, we are good).   We have two more hardware nodes to add shortly.
> 
> 
> Thanks!


-- 
CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to 
which it is addressed and may contain information that is confidential, 
privileged and exempt from disclosure under applicable law. If the reader 
of this message is not the intended recipient, you are hereby notified that 
any printing, copying, dissemination, distribution, disclosure or 
forwarding of this communication is strictly prohibited. If you have 
received this communication in error, please contact the sender immediately 
and delete it from your system. Thank You.

Re: Resource limits with Hadoop and JVM

Posted by Vinod Kumar Vavilapalli <vi...@apache.org>.

I assume you are on Linux. Also assuming that your tasks are so resource intensive that they are taking down nodes. You should enable limits per task, see http://hadoop.apache.org/docs/stable/cluster_setup.html#Memory+monitoring

What it does is that jobs are now forced to up front provide their resource requirements, and TTs enforce those limits.

HTH
+Vinod Kumar Vavilapalli
Hortonworks Inc.
http://hortonworks.com/

On Sep 16, 2013, at 1:35 PM, Forrest Aldrich wrote:

> We recently experienced a couple of situations that brought one or more Hadoop nodes down (unresponsive).   One was related to a bug in a utility we use (ffmpeg) that was resolved by compiling a new version. The next, today, occurred after attempting to join a new node to the cluster.   
> 
> A basic start of the (local) tasktracker and datanode did not work -- so based on reference, I issued:  hadoop mradmin -refreshNodes, which was to be followed by hadoop dfsadmin -refreshNodes.    The load average literally jumped to 60 and the master (which also runs a slave) became unresponsive.
> 
> Seems to me that this should never happen.   But, looking around, I saw an article from Spotify which mentioned the need to set certain resource limits on the JVM as well as in the system itself (limits.conf, we run RHEL).    I (and we) are fairly new to Hadoop, so some of these issues are very new.
> 
> I wonder if some of the experts here might be able to comment on this issue - perhaps point out settings and other measures we can take to prevent this sort of incident in the future.
> 
> Our setup is not complicated.   Have 3 hadoop nodes, the first is also a master and a slave (has more resources, too).   The underlying system we do is split up tasks to ffmpeg  (which is another issue as it tends to eat resources, but so far with a recompile, we are good).   We have two more hardware nodes to add shortly.
> 
> 
> Thanks!


-- 
CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to 
which it is addressed and may contain information that is confidential, 
privileged and exempt from disclosure under applicable law. If the reader 
of this message is not the intended recipient, you are hereby notified that 
any printing, copying, dissemination, distribution, disclosure or 
forwarding of this communication is strictly prohibited. If you have 
received this communication in error, please contact the sender immediately 
and delete it from your system. Thank You.

Re: Resource limits with Hadoop and JVM

Posted by Vinod Kumar Vavilapalli <vi...@apache.org>.

I assume you are on Linux. Also assuming that your tasks are so resource intensive that they are taking down nodes. You should enable limits per task, see http://hadoop.apache.org/docs/stable/cluster_setup.html#Memory+monitoring

What it does is that jobs are now forced to up front provide their resource requirements, and TTs enforce those limits.

HTH
+Vinod Kumar Vavilapalli
Hortonworks Inc.
http://hortonworks.com/

On Sep 16, 2013, at 1:35 PM, Forrest Aldrich wrote:

> We recently experienced a couple of situations that brought one or more Hadoop nodes down (unresponsive).   One was related to a bug in a utility we use (ffmpeg) that was resolved by compiling a new version. The next, today, occurred after attempting to join a new node to the cluster.   
> 
> A basic start of the (local) tasktracker and datanode did not work -- so based on reference, I issued:  hadoop mradmin -refreshNodes, which was to be followed by hadoop dfsadmin -refreshNodes.    The load average literally jumped to 60 and the master (which also runs a slave) became unresponsive.
> 
> Seems to me that this should never happen.   But, looking around, I saw an article from Spotify which mentioned the need to set certain resource limits on the JVM as well as in the system itself (limits.conf, we run RHEL).    I (and we) are fairly new to Hadoop, so some of these issues are very new.
> 
> I wonder if some of the experts here might be able to comment on this issue - perhaps point out settings and other measures we can take to prevent this sort of incident in the future.
> 
> Our setup is not complicated.   Have 3 hadoop nodes, the first is also a master and a slave (has more resources, too).   The underlying system we do is split up tasks to ffmpeg  (which is another issue as it tends to eat resources, but so far with a recompile, we are good).   We have two more hardware nodes to add shortly.
> 
> 
> Thanks!


-- 
CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to 
which it is addressed and may contain information that is confidential, 
privileged and exempt from disclosure under applicable law. If the reader 
of this message is not the intended recipient, you are hereby notified that 
any printing, copying, dissemination, distribution, disclosure or 
forwarding of this communication is strictly prohibited. If you have 
received this communication in error, please contact the sender immediately 
and delete it from your system. Thank You.

Re: Resource limits with Hadoop and JVM

Posted by Vinod Kumar Vavilapalli <vi...@apache.org>.

I assume you are on Linux. Also assuming that your tasks are so resource intensive that they are taking down nodes. You should enable limits per task, see http://hadoop.apache.org/docs/stable/cluster_setup.html#Memory+monitoring

What it does is that jobs are now forced to up front provide their resource requirements, and TTs enforce those limits.

HTH
+Vinod Kumar Vavilapalli
Hortonworks Inc.
http://hortonworks.com/

On Sep 16, 2013, at 1:35 PM, Forrest Aldrich wrote:

> We recently experienced a couple of situations that brought one or more Hadoop nodes down (unresponsive).   One was related to a bug in a utility we use (ffmpeg) that was resolved by compiling a new version. The next, today, occurred after attempting to join a new node to the cluster.   
> 
> A basic start of the (local) tasktracker and datanode did not work -- so based on reference, I issued:  hadoop mradmin -refreshNodes, which was to be followed by hadoop dfsadmin -refreshNodes.    The load average literally jumped to 60 and the master (which also runs a slave) became unresponsive.
> 
> Seems to me that this should never happen.   But, looking around, I saw an article from Spotify which mentioned the need to set certain resource limits on the JVM as well as in the system itself (limits.conf, we run RHEL).    I (and we) are fairly new to Hadoop, so some of these issues are very new.
> 
> I wonder if some of the experts here might be able to comment on this issue - perhaps point out settings and other measures we can take to prevent this sort of incident in the future.
> 
> Our setup is not complicated.   Have 3 hadoop nodes, the first is also a master and a slave (has more resources, too).   The underlying system we do is split up tasks to ffmpeg  (which is another issue as it tends to eat resources, but so far with a recompile, we are good).   We have two more hardware nodes to add shortly.
> 
> 
> Thanks!


-- 
CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to 
which it is addressed and may contain information that is confidential, 
privileged and exempt from disclosure under applicable law. If the reader 
of this message is not the intended recipient, you are hereby notified that 
any printing, copying, dissemination, distribution, disclosure or 
forwarding of this communication is strictly prohibited. If you have 
received this communication in error, please contact the sender immediately 
and delete it from your system. Thank You.