You are viewing a plain text version of this content. The canonical link for it is here.
Posted to mapreduce-user@hadoop.apache.org by Han-Cheol Cho <ha...@nhn-playart.com> on 2014/10/22 06:47:26 UTC

A problem with Hadoop PID files

Hi all,
 
I am using Monit to monitor hadoop processes and automatically restart them when failed.
 
From time to time, however, a hadoop process (e.g., namenode) runs with the PID, saying 1111, while its pid file (in /var/run/hadoop-hdfs/hadoop-hdfs-namenode.pid) has a different value, saying 1222.
Monit assumes that the service is not running and tries to re-run it using the specified command "/sbin/service hadoop-hdfs-namenode start".
The problem is that the Namenode is already running (with a different pid from the pid file).
Therefore, the service command fails, but it renews the pid file so that the number in this file is just growing again and again...
 
Probably, Monit, after it found the Namenode is not running, relaunches the Namenode multiple times shortly; as a result, the first one goes up but the second one overwrites the pid file.
And the launching script also does not seem to have any lock routine to protect the pid file.
 
Is there anyone who had experienced a similar problem?
Temporarily, I am using a workaround to stop the process (kill -15 pid) since "service ... stop" also does not work. 
 
Best wishes,



 
 趙漢哲  (CHO, Han-Cheol. Ph.D)
データ研究室   / 社員 (Data Science Lab.   / Data scientist)
TEL: 03-5155-1160 (部署代表)   FAX: 03-5155-3307

  --> 〒150-8510 東京都渋谷区渋谷2-21-1 渋谷ヒカリエ 27階
Email  hancheol.cho@nhn-playart.com   Messenger   



 

Re: A problem with Hadoop PID files

Posted by Han-Cheol Cho <ha...@nhn-playart.com>.
Hi, Vikas
 
Thank you for your reply.
 
 
I understand that this problem can be solved by using the method you suggested and actually I did it for a few times while digging into the reason of this problem. 
But I don't want to fix this problem manually since it can happen even while I'm sleeping at 4:00 AM :-)
 
As a working solution, I am currently using the following start and stop commands. 
------------------------------ 
check process hadoop-hdfs-namenode with pidfile /var/run/hadoop-hdfs/hadoop-hdfs-namenode.pid
  start program = "/bin/bash -c 'pkill -15 -f org\.apache\.hadoop\.hdfs\.server\.namenode\.NameNode; service hadoop-hdfs-namenode start'"
  stop  program = "/bin/bash -c 'pkill -15 -f org\.apache\.hadoop\.hdfs\.server\.namenode\.NameNode'"
------------------------------
It uses the same pidfile option to check the status of a daemon, but stops the daemon using "pkill" command instead of "service ... stop"
Therefore, the Monit daemon can stop the running daemon (not the one specified by the pidfile) even if the pidfile has a wrong number PID.

Although I am not sure how many people in this mailing list are interested in this subject, but hope this is helpful for someone.
Best wishes,
 
 
-----Original Message-----
From: "vikas srivastava"&lt;vikas_srivastava@apple.com&gt; 
To: "Han-Cheol Cho"&lt;hancheol.cho@nhn-playart.com&gt;; 
Cc: 
Sent: 2014-10-27 (月) 19:00:11
Subject: Re: A problem with Hadoop PID files


Hey , Just go and delete the file or just put the correct pid inside hdfs-namenode.pid thanks 
On Oct 21, 2014, at 9:47 PM, Han-Cheol Cho &lt;hancheol.cho@nhn-playart.com&gt; wrote:
Hi all,
  I am using Monit to monitor hadoop processes and automatically restart them when failed.
  From time to time, however, a hadoop process (e.g., namenode) runs with the PID, saying 1111, while its pid file (in /var/run/hadoop-hdfs/hadoop-hdfs-namenode.pid) has a different value, saying 1222.
Monit assumes that the service is not running and tries to re-run it using the specified command "/sbin/service hadoop-hdfs-namenode start".
The problem is that the Namenode is already running (with a different pid from the pid file).
Therefore, the service command fails, but it renews the pid file so that the number in this file is just growing again and again...
  Probably, Monit, after it found the Namenode is not running, relaunches the Namenode multiple times shortly; as a result, the first one goes up but the second one overwrites the pid file.
And the launching script also does not seem to have any lock routine to protect the pid file.
  Is there anyone who had experienced a similar problem?
Temporarily, I am using a workaround to stop the process (kill -15 pid) since "service ... stop" also does not work. 
  Best wishes,



 
 趙漢哲  (CHO, Han-Cheol. Ph.D)
データ研究室   / 社員 (Data Science Lab.   / Data scientist)
TEL: 03-5155-1160 (部署代表)   FAX: 03-5155-3307
 〒150-8510 東京都渋谷区渋谷2-21-1 渋谷ヒカリエ 27階
Email  hancheol.cho@nhn-playart.com   Messenger   


  




Re: A problem with Hadoop PID files

Posted by Han-Cheol Cho <ha...@nhn-playart.com>.
Hi, Vikas
 
Thank you for your reply.
 
 
I understand that this problem can be solved by using the method you suggested and actually I did it for a few times while digging into the reason of this problem. 
But I don't want to fix this problem manually since it can happen even while I'm sleeping at 4:00 AM :-)
 
As a working solution, I am currently using the following start and stop commands. 
------------------------------ 
check process hadoop-hdfs-namenode with pidfile /var/run/hadoop-hdfs/hadoop-hdfs-namenode.pid
  start program = "/bin/bash -c 'pkill -15 -f org\.apache\.hadoop\.hdfs\.server\.namenode\.NameNode; service hadoop-hdfs-namenode start'"
  stop  program = "/bin/bash -c 'pkill -15 -f org\.apache\.hadoop\.hdfs\.server\.namenode\.NameNode'"
------------------------------
It uses the same pidfile option to check the status of a daemon, but stops the daemon using "pkill" command instead of "service ... stop"
Therefore, the Monit daemon can stop the running daemon (not the one specified by the pidfile) even if the pidfile has a wrong number PID.

Although I am not sure how many people in this mailing list are interested in this subject, but hope this is helpful for someone.
Best wishes,
 
 
-----Original Message-----
From: "vikas srivastava"&lt;vikas_srivastava@apple.com&gt; 
To: "Han-Cheol Cho"&lt;hancheol.cho@nhn-playart.com&gt;; 
Cc: 
Sent: 2014-10-27 (月) 19:00:11
Subject: Re: A problem with Hadoop PID files


Hey , Just go and delete the file or just put the correct pid inside hdfs-namenode.pid thanks 
On Oct 21, 2014, at 9:47 PM, Han-Cheol Cho &lt;hancheol.cho@nhn-playart.com&gt; wrote:
Hi all,
  I am using Monit to monitor hadoop processes and automatically restart them when failed.
  From time to time, however, a hadoop process (e.g., namenode) runs with the PID, saying 1111, while its pid file (in /var/run/hadoop-hdfs/hadoop-hdfs-namenode.pid) has a different value, saying 1222.
Monit assumes that the service is not running and tries to re-run it using the specified command "/sbin/service hadoop-hdfs-namenode start".
The problem is that the Namenode is already running (with a different pid from the pid file).
Therefore, the service command fails, but it renews the pid file so that the number in this file is just growing again and again...
  Probably, Monit, after it found the Namenode is not running, relaunches the Namenode multiple times shortly; as a result, the first one goes up but the second one overwrites the pid file.
And the launching script also does not seem to have any lock routine to protect the pid file.
  Is there anyone who had experienced a similar problem?
Temporarily, I am using a workaround to stop the process (kill -15 pid) since "service ... stop" also does not work. 
  Best wishes,



 
 趙漢哲  (CHO, Han-Cheol. Ph.D)
データ研究室   / 社員 (Data Science Lab.   / Data scientist)
TEL: 03-5155-1160 (部署代表)   FAX: 03-5155-3307
 〒150-8510 東京都渋谷区渋谷2-21-1 渋谷ヒカリエ 27階
Email  hancheol.cho@nhn-playart.com   Messenger   


  




Re: A problem with Hadoop PID files

Posted by Han-Cheol Cho <ha...@nhn-playart.com>.
Hi, Vikas
 
Thank you for your reply.
 
 
I understand that this problem can be solved by using the method you suggested and actually I did it for a few times while digging into the reason of this problem. 
But I don't want to fix this problem manually since it can happen even while I'm sleeping at 4:00 AM :-)
 
As a working solution, I am currently using the following start and stop commands. 
------------------------------ 
check process hadoop-hdfs-namenode with pidfile /var/run/hadoop-hdfs/hadoop-hdfs-namenode.pid
  start program = "/bin/bash -c 'pkill -15 -f org\.apache\.hadoop\.hdfs\.server\.namenode\.NameNode; service hadoop-hdfs-namenode start'"
  stop  program = "/bin/bash -c 'pkill -15 -f org\.apache\.hadoop\.hdfs\.server\.namenode\.NameNode'"
------------------------------
It uses the same pidfile option to check the status of a daemon, but stops the daemon using "pkill" command instead of "service ... stop"
Therefore, the Monit daemon can stop the running daemon (not the one specified by the pidfile) even if the pidfile has a wrong number PID.

Although I am not sure how many people in this mailing list are interested in this subject, but hope this is helpful for someone.
Best wishes,
 
 
-----Original Message-----
From: "vikas srivastava"&lt;vikas_srivastava@apple.com&gt; 
To: "Han-Cheol Cho"&lt;hancheol.cho@nhn-playart.com&gt;; 
Cc: 
Sent: 2014-10-27 (月) 19:00:11
Subject: Re: A problem with Hadoop PID files


Hey , Just go and delete the file or just put the correct pid inside hdfs-namenode.pid thanks 
On Oct 21, 2014, at 9:47 PM, Han-Cheol Cho &lt;hancheol.cho@nhn-playart.com&gt; wrote:
Hi all,
  I am using Monit to monitor hadoop processes and automatically restart them when failed.
  From time to time, however, a hadoop process (e.g., namenode) runs with the PID, saying 1111, while its pid file (in /var/run/hadoop-hdfs/hadoop-hdfs-namenode.pid) has a different value, saying 1222.
Monit assumes that the service is not running and tries to re-run it using the specified command "/sbin/service hadoop-hdfs-namenode start".
The problem is that the Namenode is already running (with a different pid from the pid file).
Therefore, the service command fails, but it renews the pid file so that the number in this file is just growing again and again...
  Probably, Monit, after it found the Namenode is not running, relaunches the Namenode multiple times shortly; as a result, the first one goes up but the second one overwrites the pid file.
And the launching script also does not seem to have any lock routine to protect the pid file.
  Is there anyone who had experienced a similar problem?
Temporarily, I am using a workaround to stop the process (kill -15 pid) since "service ... stop" also does not work. 
  Best wishes,



 
 趙漢哲  (CHO, Han-Cheol. Ph.D)
データ研究室   / 社員 (Data Science Lab.   / Data scientist)
TEL: 03-5155-1160 (部署代表)   FAX: 03-5155-3307
 〒150-8510 東京都渋谷区渋谷2-21-1 渋谷ヒカリエ 27階
Email  hancheol.cho@nhn-playart.com   Messenger   


  




Re: A problem with Hadoop PID files

Posted by Han-Cheol Cho <ha...@nhn-playart.com>.
Hi, Vikas
 
Thank you for your reply.
 
 
I understand that this problem can be solved by using the method you suggested and actually I did it for a few times while digging into the reason of this problem. 
But I don't want to fix this problem manually since it can happen even while I'm sleeping at 4:00 AM :-)
 
As a working solution, I am currently using the following start and stop commands. 
------------------------------ 
check process hadoop-hdfs-namenode with pidfile /var/run/hadoop-hdfs/hadoop-hdfs-namenode.pid
  start program = "/bin/bash -c 'pkill -15 -f org\.apache\.hadoop\.hdfs\.server\.namenode\.NameNode; service hadoop-hdfs-namenode start'"
  stop  program = "/bin/bash -c 'pkill -15 -f org\.apache\.hadoop\.hdfs\.server\.namenode\.NameNode'"
------------------------------
It uses the same pidfile option to check the status of a daemon, but stops the daemon using "pkill" command instead of "service ... stop"
Therefore, the Monit daemon can stop the running daemon (not the one specified by the pidfile) even if the pidfile has a wrong number PID.

Although I am not sure how many people in this mailing list are interested in this subject, but hope this is helpful for someone.
Best wishes,
 
 
-----Original Message-----
From: "vikas srivastava"&lt;vikas_srivastava@apple.com&gt; 
To: "Han-Cheol Cho"&lt;hancheol.cho@nhn-playart.com&gt;; 
Cc: 
Sent: 2014-10-27 (月) 19:00:11
Subject: Re: A problem with Hadoop PID files


Hey , Just go and delete the file or just put the correct pid inside hdfs-namenode.pid thanks 
On Oct 21, 2014, at 9:47 PM, Han-Cheol Cho &lt;hancheol.cho@nhn-playart.com&gt; wrote:
Hi all,
  I am using Monit to monitor hadoop processes and automatically restart them when failed.
  From time to time, however, a hadoop process (e.g., namenode) runs with the PID, saying 1111, while its pid file (in /var/run/hadoop-hdfs/hadoop-hdfs-namenode.pid) has a different value, saying 1222.
Monit assumes that the service is not running and tries to re-run it using the specified command "/sbin/service hadoop-hdfs-namenode start".
The problem is that the Namenode is already running (with a different pid from the pid file).
Therefore, the service command fails, but it renews the pid file so that the number in this file is just growing again and again...
  Probably, Monit, after it found the Namenode is not running, relaunches the Namenode multiple times shortly; as a result, the first one goes up but the second one overwrites the pid file.
And the launching script also does not seem to have any lock routine to protect the pid file.
  Is there anyone who had experienced a similar problem?
Temporarily, I am using a workaround to stop the process (kill -15 pid) since "service ... stop" also does not work. 
  Best wishes,



 
 趙漢哲  (CHO, Han-Cheol. Ph.D)
データ研究室   / 社員 (Data Science Lab.   / Data scientist)
TEL: 03-5155-1160 (部署代表)   FAX: 03-5155-3307
 〒150-8510 東京都渋谷区渋谷2-21-1 渋谷ヒカリエ 27階
Email  hancheol.cho@nhn-playart.com   Messenger