You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hadoop.apache.org by David chen <c7...@163.com> on 2015/02/09 04:30:08 UTC

Stopping ntpd signals SIGTERM, then causes namenode exit

A shell script is deployed on every node of HDFS cluster, the script is invoked hourly by crontab, and its content is as follows:
#!/bin/bash
service ntpd stop
ntpdate 192.168.0.1 #it's a valid ntpd server in LAN
service ntpd start
chkconfig ntpd on


After several days, NameNode crashed suddenly, but its log seemed no other errors except the following:
2015-01-07 14:00:00,709 ERROR org.apache.hadoop.hdfs.server.namenode.NameNode: RECEIVED SIGNAL 15: SIGTERM


Inspected the Linux log(Centos /var/log/messages), also found the following clues:
Jan  7 14:00:01 host1 ntpd[32101]: ntpd exiting on signal 15
Jan  7 13:59:59 host1 ntpd[44764]: ntpd 4.2.4p8@1.1612-o Fri Feb 22 11:23:27 UTC 2013 (1)
Jan  7 13:59:59 host1 ntpd[44765]: precision = 0.143 usec
Jan  7 13:59:59 host1 ntpd[44765]: Listening on interface #0 wildcard, 0.0.0.0#123 Disabled
Jan  7 13:59:59 host1 ntpd[44765]: Listening on interface #1 wildcard, ::#123 Disabled
Jan  7 13:59:59 host1 ntpd[44765]: Listening on interface #2 lo, ::1#123 Enabled
Jan  7 13:59:59 host1 ntpd[44765]: Listening on interface #3 em2, fe80::ca1f:66ff:fee1:eed#123 Enabled
Jan  7 13:59:59 host1 ntpd[44765]: Listening on interface #4 lo, 127.0.0.1#123 Enabled
Jan  7 13:59:59 host1 ntpd[44765]: Listening on interface #5 em2, 192.168.1.151#123 Enabled
Jan  7 13:59:59 host1 ntpd[44765]: Listening on routing socket on fd #22 for interface updates
Jan  7 13:59:59 host1 ntpd[44765]: kernel time sync status 2040
Jan  7 13:59:59 host1 ntpd[44765]: frequency initialized 499.399 PPM from /var/lib/ntp/drift
Jan  7 14:00:01 host1 ntpd_initres[32103]: parent died before we finished, exiting
Jan  7 14:04:17 host1 ntpd[44765]: synchronized to 192.168.0.191, stratum 2
Jan  7 14:04:17 host1 ntpd[44765]: kernel time sync status change 2001
Jan  7 14:26:02 host1 snmpd[4842]: Received TERM or STOP signal...  shutting down...
Jan  7 14:26:02 host1 kernel: netlink: 12 bytes leftover after parsing attributes.
Jan  7 14:26:02 host1 snmpd[45667]: NET-SNMP version 5.5
Jan  7 14:52:48 host1 ntpd[44765]: no servers reachable


It looks likely that NameNode received the SIGTERM signal sent by stopping ntpd command.
Up to now, the problem has happened three times repeatedly, the time point was Jan  7 14:00:00, Jan 14 14:00:00 and Feb  4 14:00:00 respectively.
Although the script to synchronize time is a little improper, and i also know the correct synchronized way. but i wonder why NameNode can receive the SIGTERM signal sent by stopping ntpd command? and why three times all happened at 14:00:00?
Any ideas can be appreciated.

Re:Re: Re: Stopping ntpd signals SIGTERM, then causes namenode exit

Posted by David chen <c7...@163.com>.

Hi Or Sher，
Thanks for sharing your experiences, it indeed helped me.

Re:Re: Re: Stopping ntpd signals SIGTERM, then causes namenode exit

Posted by David chen <c7...@163.com>.

Hi Or Sher，
Thanks for sharing your experiences, it indeed helped me.

Re:Re: Re: Stopping ntpd signals SIGTERM, then causes namenode exit

Posted by David chen <c7...@163.com>.

Hi Or Sher，
Thanks for sharing your experiences, it indeed helped me.

Re:Re: Re: Stopping ntpd signals SIGTERM, then causes namenode exit

Posted by David chen <c7...@163.com>.

Hi Or Sher，
Thanks for sharing your experiences, it indeed helped me.

Re: Re: Stopping ntpd signals SIGTERM, then causes namenode exit

Posted by Or Sher <or...@gmail.com>.

I'm not sure it's related but I encountered a similar issue a few months
ago.
In my case, it was an "at" command sending a kill signal to the at daemon
with it's correct pid.
Somehow, once in a while this signal got to Cassandra process (Java as
well) and killed it.
After some time of investigation I assumed this have to be a kernel bug or
something and I've opened a ticket for CentOS -
http://bugs.centos.org/view.php?id=7539 which no body is really looking at
:)
You can read there how I tried to tackle it.
Bottom line, we've changed the at scheduler to a different implementation
and we don't get this issue any more.

HTH,
Or.


On Wed, Feb 11, 2015 at 3:39 AM, David chen <c7...@163.com> wrote:

> The command 'service ntpd stop' could be triggered around 14:00.
> Because the crontab was set as follows:
> 0 * * * * sh sync.sh
> The script contains the following command:
> #!/bin/bash
> service ntpd stop
> ntpdate 192.168.0.1 #it's a valid ntpd server in LAN
> service ntpd start
> chkconfig ntpd on
>
> Found the following fragment in /var/log/message:
> Jan  7 14:00:01 host1 ntpd[32101]: ntpd exiting on signal 15
> Jan  7 13:59:59 host1 ntpd[44764]: ntpd 4.2.4p8@1.1612-o Fri Feb 22
> 11:23:27 UTC 2013 (1)
> Jan  7 13:59:59 host1 ntpd[44765]: precision = 0.143 usec
> Jan  7 13:59:59 host1 ntpd[44765]: Listening on interface #0 wildcard,
> 0.0.0.0#123 Disabled
> Jan  7 13:59:59 host1 ntpd[44765]: Listening on interface #1 wildcard,
> ::#123 Disabled
> Jan  7 13:59:59 host1 ntpd[44765]: Listening on interface #2 lo, ::1#123
> Enabled
> Jan  7 13:59:59 host1 ntpd[44765]: Listening on interface #3 em2,
> fe80::ca1f:66ff:fee1:eed#123 Enabled
> Jan  7 13:59:59 host1 ntpd[44765]: Listening on interface #4 lo,
> 127.0.0.1#123 Enabled
> Jan  7 13:59:59 host1 ntpd[44765]: Listening on interface #5 em2,
> 192.168.1.151#123 Enabled
> Jan  7 13:59:59 host1 ntpd[44765]: Listening on routing socket on fd #22
> for interface updates
> Jan  7 13:59:59 host1 ntpd[44765]: kernel time sync status 2040
> Jan  7 13:59:59 host1 ntpd[44765]: frequency initialized 499.399 PPM from
> /var/lib/ntp/drift
> Jan  7 14:00:01 host1 ntpd_initres[32103]: parent died before we finished,
> exiting
> Jan  7 14:04:17 host1 ntpd[44765]: synchronized to 192.168.0.191, stratum 2
> Jan  7 14:04:17 host1 ntpd[44765]: kernel time sync status change 2001
> Jan  7 14:26:02 host1 snmpd[4842]: Received TERM or STOP signal...
>  shutting down...
> Jan  7 14:26:02 host1 kernel: netlink: 12 bytes leftover after parsing
> attributes.
> Jan  7 14:26:02 host1 snmpd[45667]: NET-SNMP version 5.5
> Jan  7 14:52:48 host1 ntpd[44765]: no servers reachable
>
> It looks likely the command 'service ntpd stop' send the SIGTERM signal.
> The above clue 'ntpd[32101]' indicates that the ntpd process PID is 32101,
> inspect NameNode log, i found that the NameNode process PID was not
> identical with ntpd.
> So i wonder why Namenode process can received the signal?
>



-- 
Or Sher

Re: Re: Stopping ntpd signals SIGTERM, then causes namenode exit

Posted by Or Sher <or...@gmail.com>.

I'm not sure it's related but I encountered a similar issue a few months
ago.
In my case, it was an "at" command sending a kill signal to the at daemon
with it's correct pid.
Somehow, once in a while this signal got to Cassandra process (Java as
well) and killed it.
After some time of investigation I assumed this have to be a kernel bug or
something and I've opened a ticket for CentOS -
http://bugs.centos.org/view.php?id=7539 which no body is really looking at
:)
You can read there how I tried to tackle it.
Bottom line, we've changed the at scheduler to a different implementation
and we don't get this issue any more.

HTH,
Or.


On Wed, Feb 11, 2015 at 3:39 AM, David chen <c7...@163.com> wrote:

> The command 'service ntpd stop' could be triggered around 14:00.
> Because the crontab was set as follows:
> 0 * * * * sh sync.sh
> The script contains the following command:
> #!/bin/bash
> service ntpd stop
> ntpdate 192.168.0.1 #it's a valid ntpd server in LAN
> service ntpd start
> chkconfig ntpd on
>
> Found the following fragment in /var/log/message:
> Jan  7 14:00:01 host1 ntpd[32101]: ntpd exiting on signal 15
> Jan  7 13:59:59 host1 ntpd[44764]: ntpd 4.2.4p8@1.1612-o Fri Feb 22
> 11:23:27 UTC 2013 (1)
> Jan  7 13:59:59 host1 ntpd[44765]: precision = 0.143 usec
> Jan  7 13:59:59 host1 ntpd[44765]: Listening on interface #0 wildcard,
> 0.0.0.0#123 Disabled
> Jan  7 13:59:59 host1 ntpd[44765]: Listening on interface #1 wildcard,
> ::#123 Disabled
> Jan  7 13:59:59 host1 ntpd[44765]: Listening on interface #2 lo, ::1#123
> Enabled
> Jan  7 13:59:59 host1 ntpd[44765]: Listening on interface #3 em2,
> fe80::ca1f:66ff:fee1:eed#123 Enabled
> Jan  7 13:59:59 host1 ntpd[44765]: Listening on interface #4 lo,
> 127.0.0.1#123 Enabled
> Jan  7 13:59:59 host1 ntpd[44765]: Listening on interface #5 em2,
> 192.168.1.151#123 Enabled
> Jan  7 13:59:59 host1 ntpd[44765]: Listening on routing socket on fd #22
> for interface updates
> Jan  7 13:59:59 host1 ntpd[44765]: kernel time sync status 2040
> Jan  7 13:59:59 host1 ntpd[44765]: frequency initialized 499.399 PPM from
> /var/lib/ntp/drift
> Jan  7 14:00:01 host1 ntpd_initres[32103]: parent died before we finished,
> exiting
> Jan  7 14:04:17 host1 ntpd[44765]: synchronized to 192.168.0.191, stratum 2
> Jan  7 14:04:17 host1 ntpd[44765]: kernel time sync status change 2001
> Jan  7 14:26:02 host1 snmpd[4842]: Received TERM or STOP signal...
>  shutting down...
> Jan  7 14:26:02 host1 kernel: netlink: 12 bytes leftover after parsing
> attributes.
> Jan  7 14:26:02 host1 snmpd[45667]: NET-SNMP version 5.5
> Jan  7 14:52:48 host1 ntpd[44765]: no servers reachable
>
> It looks likely the command 'service ntpd stop' send the SIGTERM signal.
> The above clue 'ntpd[32101]' indicates that the ntpd process PID is 32101,
> inspect NameNode log, i found that the NameNode process PID was not
> identical with ntpd.
> So i wonder why Namenode process can received the signal?
>



-- 
Or Sher

Re: Re: Stopping ntpd signals SIGTERM, then causes namenode exit

Posted by Or Sher <or...@gmail.com>.

I'm not sure it's related but I encountered a similar issue a few months
ago.
In my case, it was an "at" command sending a kill signal to the at daemon
with it's correct pid.
Somehow, once in a while this signal got to Cassandra process (Java as
well) and killed it.
After some time of investigation I assumed this have to be a kernel bug or
something and I've opened a ticket for CentOS -
http://bugs.centos.org/view.php?id=7539 which no body is really looking at
:)
You can read there how I tried to tackle it.
Bottom line, we've changed the at scheduler to a different implementation
and we don't get this issue any more.

HTH,
Or.


On Wed, Feb 11, 2015 at 3:39 AM, David chen <c7...@163.com> wrote:

> The command 'service ntpd stop' could be triggered around 14:00.
> Because the crontab was set as follows:
> 0 * * * * sh sync.sh
> The script contains the following command:
> #!/bin/bash
> service ntpd stop
> ntpdate 192.168.0.1 #it's a valid ntpd server in LAN
> service ntpd start
> chkconfig ntpd on
>
> Found the following fragment in /var/log/message:
> Jan  7 14:00:01 host1 ntpd[32101]: ntpd exiting on signal 15
> Jan  7 13:59:59 host1 ntpd[44764]: ntpd 4.2.4p8@1.1612-o Fri Feb 22
> 11:23:27 UTC 2013 (1)
> Jan  7 13:59:59 host1 ntpd[44765]: precision = 0.143 usec
> Jan  7 13:59:59 host1 ntpd[44765]: Listening on interface #0 wildcard,
> 0.0.0.0#123 Disabled
> Jan  7 13:59:59 host1 ntpd[44765]: Listening on interface #1 wildcard,
> ::#123 Disabled
> Jan  7 13:59:59 host1 ntpd[44765]: Listening on interface #2 lo, ::1#123
> Enabled
> Jan  7 13:59:59 host1 ntpd[44765]: Listening on interface #3 em2,
> fe80::ca1f:66ff:fee1:eed#123 Enabled
> Jan  7 13:59:59 host1 ntpd[44765]: Listening on interface #4 lo,
> 127.0.0.1#123 Enabled
> Jan  7 13:59:59 host1 ntpd[44765]: Listening on interface #5 em2,
> 192.168.1.151#123 Enabled
> Jan  7 13:59:59 host1 ntpd[44765]: Listening on routing socket on fd #22
> for interface updates
> Jan  7 13:59:59 host1 ntpd[44765]: kernel time sync status 2040
> Jan  7 13:59:59 host1 ntpd[44765]: frequency initialized 499.399 PPM from
> /var/lib/ntp/drift
> Jan  7 14:00:01 host1 ntpd_initres[32103]: parent died before we finished,
> exiting
> Jan  7 14:04:17 host1 ntpd[44765]: synchronized to 192.168.0.191, stratum 2
> Jan  7 14:04:17 host1 ntpd[44765]: kernel time sync status change 2001
> Jan  7 14:26:02 host1 snmpd[4842]: Received TERM or STOP signal...
>  shutting down...
> Jan  7 14:26:02 host1 kernel: netlink: 12 bytes leftover after parsing
> attributes.
> Jan  7 14:26:02 host1 snmpd[45667]: NET-SNMP version 5.5
> Jan  7 14:52:48 host1 ntpd[44765]: no servers reachable
>
> It looks likely the command 'service ntpd stop' send the SIGTERM signal.
> The above clue 'ntpd[32101]' indicates that the ntpd process PID is 32101,
> inspect NameNode log, i found that the NameNode process PID was not
> identical with ntpd.
> So i wonder why Namenode process can received the signal?
>



-- 
Or Sher

Re: Re: Stopping ntpd signals SIGTERM, then causes namenode exit

Posted by Or Sher <or...@gmail.com>.

I'm not sure it's related but I encountered a similar issue a few months
ago.
In my case, it was an "at" command sending a kill signal to the at daemon
with it's correct pid.
Somehow, once in a while this signal got to Cassandra process (Java as
well) and killed it.
After some time of investigation I assumed this have to be a kernel bug or
something and I've opened a ticket for CentOS -
http://bugs.centos.org/view.php?id=7539 which no body is really looking at
:)
You can read there how I tried to tackle it.
Bottom line, we've changed the at scheduler to a different implementation
and we don't get this issue any more.

HTH,
Or.


On Wed, Feb 11, 2015 at 3:39 AM, David chen <c7...@163.com> wrote:

> The command 'service ntpd stop' could be triggered around 14:00.
> Because the crontab was set as follows:
> 0 * * * * sh sync.sh
> The script contains the following command:
> #!/bin/bash
> service ntpd stop
> ntpdate 192.168.0.1 #it's a valid ntpd server in LAN
> service ntpd start
> chkconfig ntpd on
>
> Found the following fragment in /var/log/message:
> Jan  7 14:00:01 host1 ntpd[32101]: ntpd exiting on signal 15
> Jan  7 13:59:59 host1 ntpd[44764]: ntpd 4.2.4p8@1.1612-o Fri Feb 22
> 11:23:27 UTC 2013 (1)
> Jan  7 13:59:59 host1 ntpd[44765]: precision = 0.143 usec
> Jan  7 13:59:59 host1 ntpd[44765]: Listening on interface #0 wildcard,
> 0.0.0.0#123 Disabled
> Jan  7 13:59:59 host1 ntpd[44765]: Listening on interface #1 wildcard,
> ::#123 Disabled
> Jan  7 13:59:59 host1 ntpd[44765]: Listening on interface #2 lo, ::1#123
> Enabled
> Jan  7 13:59:59 host1 ntpd[44765]: Listening on interface #3 em2,
> fe80::ca1f:66ff:fee1:eed#123 Enabled
> Jan  7 13:59:59 host1 ntpd[44765]: Listening on interface #4 lo,
> 127.0.0.1#123 Enabled
> Jan  7 13:59:59 host1 ntpd[44765]: Listening on interface #5 em2,
> 192.168.1.151#123 Enabled
> Jan  7 13:59:59 host1 ntpd[44765]: Listening on routing socket on fd #22
> for interface updates
> Jan  7 13:59:59 host1 ntpd[44765]: kernel time sync status 2040
> Jan  7 13:59:59 host1 ntpd[44765]: frequency initialized 499.399 PPM from
> /var/lib/ntp/drift
> Jan  7 14:00:01 host1 ntpd_initres[32103]: parent died before we finished,
> exiting
> Jan  7 14:04:17 host1 ntpd[44765]: synchronized to 192.168.0.191, stratum 2
> Jan  7 14:04:17 host1 ntpd[44765]: kernel time sync status change 2001
> Jan  7 14:26:02 host1 snmpd[4842]: Received TERM or STOP signal...
>  shutting down...
> Jan  7 14:26:02 host1 kernel: netlink: 12 bytes leftover after parsing
> attributes.
> Jan  7 14:26:02 host1 snmpd[45667]: NET-SNMP version 5.5
> Jan  7 14:52:48 host1 ntpd[44765]: no servers reachable
>
> It looks likely the command 'service ntpd stop' send the SIGTERM signal.
> The above clue 'ntpd[32101]' indicates that the ntpd process PID is 32101,
> inspect NameNode log, i found that the NameNode process PID was not
> identical with ntpd.
> So i wonder why Namenode process can received the signal?
>



-- 
Or Sher

Re:Re: Stopping ntpd signals SIGTERM, then causes namenode exit

Posted by David chen <c7...@163.com>.

The command 'service ntpd stop' could be triggered around 14:00.
Because the crontab was set as follows:
0 * * * * sh sync.sh
The script contains the following command:
#!/bin/bash
service ntpd stop
ntpdate 192.168.0.1 #it's a valid ntpd server in LAN
service ntpd start
chkconfig ntpd on


Found the following fragment in /var/log/message: 
Jan  7 14:00:01 host1 ntpd[32101]: ntpd exiting on signal 15
Jan  7 13:59:59 host1 ntpd[44764]: ntpd 4.2.4p8@1.1612-o Fri Feb 22 11:23:27 UTC 2013 (1)
Jan  7 13:59:59 host1 ntpd[44765]: precision = 0.143 usec
Jan  7 13:59:59 host1 ntpd[44765]: Listening on interface #0 wildcard, 0.0.0.0#123 Disabled
Jan  7 13:59:59 host1 ntpd[44765]: Listening on interface #1 wildcard, ::#123 Disabled
Jan  7 13:59:59 host1 ntpd[44765]: Listening on interface #2 lo, ::1#123 Enabled
Jan  7 13:59:59 host1 ntpd[44765]: Listening on interface #3 em2, fe80::ca1f:66ff:fee1:eed#123 Enabled
Jan  7 13:59:59 host1 ntpd[44765]: Listening on interface #4 lo, 127.0.0.1#123 Enabled
Jan  7 13:59:59 host1 ntpd[44765]: Listening on interface #5 em2, 192.168.1.151#123 Enabled
Jan  7 13:59:59 host1 ntpd[44765]: Listening on routing socket on fd #22 for interface updates
Jan  7 13:59:59 host1 ntpd[44765]: kernel time sync status 2040
Jan  7 13:59:59 host1 ntpd[44765]: frequency initialized 499.399 PPM from /var/lib/ntp/drift
Jan  7 14:00:01 host1 ntpd_initres[32103]: parent died before we finished, exiting
Jan  7 14:04:17 host1 ntpd[44765]: synchronized to 192.168.0.191, stratum 2
Jan  7 14:04:17 host1 ntpd[44765]: kernel time sync status change 2001
Jan  7 14:26:02 host1 snmpd[4842]: Received TERM or STOP signal...  shutting down...
Jan  7 14:26:02 host1 kernel: netlink: 12 bytes leftover after parsing attributes.
Jan  7 14:26:02 host1 snmpd[45667]: NET-SNMP version 5.5
Jan  7 14:52:48 host1 ntpd[44765]: no servers reachable


It looks likely the command 'service ntpd stop' send the SIGTERM signal. The above clue 'ntpd[32101]' indicates that the ntpd process PID is 32101, inspect NameNode log, i found that the NameNode process PID was not identical with ntpd. 
So i wonder why Namenode process can received the signal?

Re:Re: Stopping ntpd signals SIGTERM, then causes namenode exit

Posted by David chen <c7...@163.com>.

The command 'service ntpd stop' could be triggered around 14:00.
Because the crontab was set as follows:
0 * * * * sh sync.sh
The script contains the following command:
#!/bin/bash
service ntpd stop
ntpdate 192.168.0.1 #it's a valid ntpd server in LAN
service ntpd start
chkconfig ntpd on


Found the following fragment in /var/log/message: 
Jan  7 14:00:01 host1 ntpd[32101]: ntpd exiting on signal 15
Jan  7 13:59:59 host1 ntpd[44764]: ntpd 4.2.4p8@1.1612-o Fri Feb 22 11:23:27 UTC 2013 (1)
Jan  7 13:59:59 host1 ntpd[44765]: precision = 0.143 usec
Jan  7 13:59:59 host1 ntpd[44765]: Listening on interface #0 wildcard, 0.0.0.0#123 Disabled
Jan  7 13:59:59 host1 ntpd[44765]: Listening on interface #1 wildcard, ::#123 Disabled
Jan  7 13:59:59 host1 ntpd[44765]: Listening on interface #2 lo, ::1#123 Enabled
Jan  7 13:59:59 host1 ntpd[44765]: Listening on interface #3 em2, fe80::ca1f:66ff:fee1:eed#123 Enabled
Jan  7 13:59:59 host1 ntpd[44765]: Listening on interface #4 lo, 127.0.0.1#123 Enabled
Jan  7 13:59:59 host1 ntpd[44765]: Listening on interface #5 em2, 192.168.1.151#123 Enabled
Jan  7 13:59:59 host1 ntpd[44765]: Listening on routing socket on fd #22 for interface updates
Jan  7 13:59:59 host1 ntpd[44765]: kernel time sync status 2040
Jan  7 13:59:59 host1 ntpd[44765]: frequency initialized 499.399 PPM from /var/lib/ntp/drift
Jan  7 14:00:01 host1 ntpd_initres[32103]: parent died before we finished, exiting
Jan  7 14:04:17 host1 ntpd[44765]: synchronized to 192.168.0.191, stratum 2
Jan  7 14:04:17 host1 ntpd[44765]: kernel time sync status change 2001
Jan  7 14:26:02 host1 snmpd[4842]: Received TERM or STOP signal...  shutting down...
Jan  7 14:26:02 host1 kernel: netlink: 12 bytes leftover after parsing attributes.
Jan  7 14:26:02 host1 snmpd[45667]: NET-SNMP version 5.5
Jan  7 14:52:48 host1 ntpd[44765]: no servers reachable


It looks likely the command 'service ntpd stop' send the SIGTERM signal. The above clue 'ntpd[32101]' indicates that the ntpd process PID is 32101, inspect NameNode log, i found that the NameNode process PID was not identical with ntpd. 
So i wonder why Namenode process can received the signal?

Re:Re: Stopping ntpd signals SIGTERM, then causes namenode exit

Posted by David chen <c7...@163.com>.

The command 'service ntpd stop' could be triggered around 14:00.
Because the crontab was set as follows:
0 * * * * sh sync.sh
The script contains the following command:
#!/bin/bash
service ntpd stop
ntpdate 192.168.0.1 #it's a valid ntpd server in LAN
service ntpd start
chkconfig ntpd on


Found the following fragment in /var/log/message: 
Jan  7 14:00:01 host1 ntpd[32101]: ntpd exiting on signal 15
Jan  7 13:59:59 host1 ntpd[44764]: ntpd 4.2.4p8@1.1612-o Fri Feb 22 11:23:27 UTC 2013 (1)
Jan  7 13:59:59 host1 ntpd[44765]: precision = 0.143 usec
Jan  7 13:59:59 host1 ntpd[44765]: Listening on interface #0 wildcard, 0.0.0.0#123 Disabled
Jan  7 13:59:59 host1 ntpd[44765]: Listening on interface #1 wildcard, ::#123 Disabled
Jan  7 13:59:59 host1 ntpd[44765]: Listening on interface #2 lo, ::1#123 Enabled
Jan  7 13:59:59 host1 ntpd[44765]: Listening on interface #3 em2, fe80::ca1f:66ff:fee1:eed#123 Enabled
Jan  7 13:59:59 host1 ntpd[44765]: Listening on interface #4 lo, 127.0.0.1#123 Enabled
Jan  7 13:59:59 host1 ntpd[44765]: Listening on interface #5 em2, 192.168.1.151#123 Enabled
Jan  7 13:59:59 host1 ntpd[44765]: Listening on routing socket on fd #22 for interface updates
Jan  7 13:59:59 host1 ntpd[44765]: kernel time sync status 2040
Jan  7 13:59:59 host1 ntpd[44765]: frequency initialized 499.399 PPM from /var/lib/ntp/drift
Jan  7 14:00:01 host1 ntpd_initres[32103]: parent died before we finished, exiting
Jan  7 14:04:17 host1 ntpd[44765]: synchronized to 192.168.0.191, stratum 2
Jan  7 14:04:17 host1 ntpd[44765]: kernel time sync status change 2001
Jan  7 14:26:02 host1 snmpd[4842]: Received TERM or STOP signal...  shutting down...
Jan  7 14:26:02 host1 kernel: netlink: 12 bytes leftover after parsing attributes.
Jan  7 14:26:02 host1 snmpd[45667]: NET-SNMP version 5.5
Jan  7 14:52:48 host1 ntpd[44765]: no servers reachable


It looks likely the command 'service ntpd stop' send the SIGTERM signal. The above clue 'ntpd[32101]' indicates that the ntpd process PID is 32101, inspect NameNode log, i found that the NameNode process PID was not identical with ntpd. 
So i wonder why Namenode process can received the signal?

Re:Re: Stopping ntpd signals SIGTERM, then causes namenode exit

Posted by David chen <c7...@163.com>.

The command 'service ntpd stop' could be triggered around 14:00.
Because the crontab was set as follows:
0 * * * * sh sync.sh
The script contains the following command:
#!/bin/bash
service ntpd stop
ntpdate 192.168.0.1 #it's a valid ntpd server in LAN
service ntpd start
chkconfig ntpd on


Found the following fragment in /var/log/message: 
Jan  7 14:00:01 host1 ntpd[32101]: ntpd exiting on signal 15
Jan  7 13:59:59 host1 ntpd[44764]: ntpd 4.2.4p8@1.1612-o Fri Feb 22 11:23:27 UTC 2013 (1)
Jan  7 13:59:59 host1 ntpd[44765]: precision = 0.143 usec
Jan  7 13:59:59 host1 ntpd[44765]: Listening on interface #0 wildcard, 0.0.0.0#123 Disabled
Jan  7 13:59:59 host1 ntpd[44765]: Listening on interface #1 wildcard, ::#123 Disabled
Jan  7 13:59:59 host1 ntpd[44765]: Listening on interface #2 lo, ::1#123 Enabled
Jan  7 13:59:59 host1 ntpd[44765]: Listening on interface #3 em2, fe80::ca1f:66ff:fee1:eed#123 Enabled
Jan  7 13:59:59 host1 ntpd[44765]: Listening on interface #4 lo, 127.0.0.1#123 Enabled
Jan  7 13:59:59 host1 ntpd[44765]: Listening on interface #5 em2, 192.168.1.151#123 Enabled
Jan  7 13:59:59 host1 ntpd[44765]: Listening on routing socket on fd #22 for interface updates
Jan  7 13:59:59 host1 ntpd[44765]: kernel time sync status 2040
Jan  7 13:59:59 host1 ntpd[44765]: frequency initialized 499.399 PPM from /var/lib/ntp/drift
Jan  7 14:00:01 host1 ntpd_initres[32103]: parent died before we finished, exiting
Jan  7 14:04:17 host1 ntpd[44765]: synchronized to 192.168.0.191, stratum 2
Jan  7 14:04:17 host1 ntpd[44765]: kernel time sync status change 2001
Jan  7 14:26:02 host1 snmpd[4842]: Received TERM or STOP signal...  shutting down...
Jan  7 14:26:02 host1 kernel: netlink: 12 bytes leftover after parsing attributes.
Jan  7 14:26:02 host1 snmpd[45667]: NET-SNMP version 5.5
Jan  7 14:52:48 host1 ntpd[44765]: no servers reachable


It looks likely the command 'service ntpd stop' send the SIGTERM signal. The above clue 'ntpd[32101]' indicates that the ntpd process PID is 32101, inspect NameNode log, i found that the NameNode process PID was not identical with ntpd. 
So i wonder why Namenode process can received the signal?

Re: Stopping ntpd signals SIGTERM, then causes namenode exit

Posted by Alexander Alten-Lorenz <wg...@gmail.com>.

The sigterm should be noted in the NN logs, whats happen around 14:00? 

BR,
 Alexander 

> On 10 Feb 2015, at 02:39, David chen <c7...@163.com> wrote:
> 
> Thanks for your reply.
> I ever indeed misunderstand how ntpd works, so deployed the shell script on every node to synchronize time. The reason why the script contains stopping ntpd command is that the following command 'ntpdate 192.168.0.1' will fail if miss it.
> Now i have understood the difference between ntp and ntpdate, so i discarded the script, then configured the ntp.conf to synchronize time.
> The reason to post the thread is that i wonder why NameNode receive the SIGTERM signal sent by stopping ntpd command? and why three times all happened at 14:00:00?

Re: Stopping ntpd signals SIGTERM, then causes namenode exit

Posted by Alexander Alten-Lorenz <wg...@gmail.com>.

The sigterm should be noted in the NN logs, whats happen around 14:00? 

BR,
 Alexander 

> On 10 Feb 2015, at 02:39, David chen <c7...@163.com> wrote:
> 
> Thanks for your reply.
> I ever indeed misunderstand how ntpd works, so deployed the shell script on every node to synchronize time. The reason why the script contains stopping ntpd command is that the following command 'ntpdate 192.168.0.1' will fail if miss it.
> Now i have understood the difference between ntp and ntpdate, so i discarded the script, then configured the ntp.conf to synchronize time.
> The reason to post the thread is that i wonder why NameNode receive the SIGTERM signal sent by stopping ntpd command? and why three times all happened at 14:00:00?

Re: Stopping ntpd signals SIGTERM, then causes namenode exit

Posted by Alexander Alten-Lorenz <wg...@gmail.com>.

The sigterm should be noted in the NN logs, whats happen around 14:00? 

BR,
 Alexander 

> On 10 Feb 2015, at 02:39, David chen <c7...@163.com> wrote:
> 
> Thanks for your reply.
> I ever indeed misunderstand how ntpd works, so deployed the shell script on every node to synchronize time. The reason why the script contains stopping ntpd command is that the following command 'ntpdate 192.168.0.1' will fail if miss it.
> Now i have understood the difference between ntp and ntpdate, so i discarded the script, then configured the ntp.conf to synchronize time.
> The reason to post the thread is that i wonder why NameNode receive the SIGTERM signal sent by stopping ntpd command? and why three times all happened at 14:00:00?

Re: Stopping ntpd signals SIGTERM, then causes namenode exit

Posted by Alexander Alten-Lorenz <wg...@gmail.com>.

The sigterm should be noted in the NN logs, whats happen around 14:00? 

BR,
 Alexander 

> On 10 Feb 2015, at 02:39, David chen <c7...@163.com> wrote:
> 
> Thanks for your reply.
> I ever indeed misunderstand how ntpd works, so deployed the shell script on every node to synchronize time. The reason why the script contains stopping ntpd command is that the following command 'ntpdate 192.168.0.1' will fail if miss it.
> Now i have understood the difference between ntp and ntpdate, so i discarded the script, then configured the ntp.conf to synchronize time.
> The reason to post the thread is that i wonder why NameNode receive the SIGTERM signal sent by stopping ntpd command? and why three times all happened at 14:00:00?

Re:Re: Stopping ntpd signals SIGTERM, then causes namenode exit

Posted by David chen <c7...@163.com>.

Thanks for your reply.
I ever indeed misunderstand how ntpd works, so deployed the shell script on every node to synchronize time. The reason why the script contains stopping ntpd command is that the following command 'ntpdate 192.168.0.1' will fail if miss it.
Now i have understood the difference between ntp and ntpdate, so i discarded the script, then configured the ntp.conf to synchronize time.
The reason to post the thread is that i wonder why NameNode receive the SIGTERM signal sent by stopping ntpd command? and why three times all happened at 14:00:00?

Re:Re: Stopping ntpd signals SIGTERM, then causes namenode exit

Posted by David chen <c7...@163.com>.

Thanks for your reply.
I ever indeed misunderstand how ntpd works, so deployed the shell script on every node to synchronize time. The reason why the script contains stopping ntpd command is that the following command 'ntpdate 192.168.0.1' will fail if miss it.
Now i have understood the difference between ntp and ntpdate, so i discarded the script, then configured the ntp.conf to synchronize time.
The reason to post the thread is that i wonder why NameNode receive the SIGTERM signal sent by stopping ntpd command? and why three times all happened at 14:00:00?

Re:Re: Stopping ntpd signals SIGTERM, then causes namenode exit

Posted by David chen <c7...@163.com>.

Thanks for your reply.
I ever indeed misunderstand how ntpd works, so deployed the shell script on every node to synchronize time. The reason why the script contains stopping ntpd command is that the following command 'ntpdate 192.168.0.1' will fail if miss it.
Now i have understood the difference between ntp and ntpdate, so i discarded the script, then configured the ntp.conf to synchronize time.
The reason to post the thread is that i wonder why NameNode receive the SIGTERM signal sent by stopping ntpd command? and why three times all happened at 14:00:00?

Re:Re: Stopping ntpd signals SIGTERM, then causes namenode exit

Posted by David chen <c7...@163.com>.

Thanks for your reply.
I ever indeed misunderstand how ntpd works, so deployed the shell script on every node to synchronize time. The reason why the script contains stopping ntpd command is that the following command 'ntpdate 192.168.0.1' will fail if miss it.
Now i have understood the difference between ntp and ntpdate, so i discarded the script, then configured the ntp.conf to synchronize time.
The reason to post the thread is that i wonder why NameNode receive the SIGTERM signal sent by stopping ntpd command? and why three times all happened at 14:00:00?

Re: Stopping ntpd signals SIGTERM, then causes namenode exit

Posted by Alexander Alten-Lorenz <wg...@gmail.com>.

I would spot on 

Jan  7 14:52:48 host1 ntpd[44765]: no servers reachable

looks for me like an network / DNS issue. You can check per dmesg whats going on, too.

BR
- Alexander

> On 09 Feb 2015, at 17:57, daemeon reiydelle <da...@gmail.com> wrote:
> 
> Absolutely a critical error to lose the configured ntpd time source in Hadoop. The replication and many other services require absolutely millisecond time sync between the nodes. Interesting that your SRE design called for ntpd running on each node. Curious.
> 
> What is the problem you are trying to solve by stopping ntpd on the local host? Did someone not understand how ntpd works? Did someone configure it to (I sure hope not) be free running?
> 
> 
> 
> .......
> “Life should not be a journey to the grave with the intention of arriving safely in a
> pretty and well preserved body, but rather to skid in broadside in a cloud of smoke,
> thoroughly used up, totally worn out, and loudly proclaiming “Wow! What a Ride!” 
> - Hunter Thompson
> 
> Daemeon C.M. Reiydelle
> USA (+1) 415.501.0198
> London (+44) (0) 20 8144 9872
> 
> On Sun, Feb 8, 2015 at 7:30 PM, David chen <c77_cn@163.com <ma...@163.com>> wrote:
> A shell script is deployed on every node of HDFS cluster, the script is invoked hourly by crontab, and its content is as follows:
> #!/bin/bash
> service ntpd stop
> ntpdate 192.168.0.1 #it's a valid ntpd server in LAN
> service ntpd start
> chkconfig ntpd on
> 
> After several days, NameNode crashed suddenly, but its log seemed no other errors except the following:
> 2015-01-07 14:00:00,709 ERROR org.apache.hadoop.hdfs.server.namenode.NameNode: RECEIVED SIGNAL 15: SIGTERM
> 
> Inspected the Linux log(Centos /var/log/messages), also found the following clues:
> Jan  7 14:00:01 host1 ntpd[32101]: ntpd exiting on signal 15
> Jan  7 13:59:59 host1 ntpd[44764]: ntpd 4.2.4p8@1.1612-o Fri Feb 22 11:23:27 UTC 2013 (1)
> Jan  7 13:59:59 host1 ntpd[44765]: precision = 0.143 usec
> Jan  7 13:59:59 host1 ntpd[44765]: Listening on interface #0 wildcard, 0.0.0.0#123 Disabled
> Jan  7 13:59:59 host1 ntpd[44765]: Listening on interface #1 wildcard, ::#123 Disabled
> Jan  7 13:59:59 host1 ntpd[44765]: Listening on interface #2 lo, ::1#123 Enabled
> Jan  7 13:59:59 host1 ntpd[44765]: Listening on interface #3 em2, fe80::ca1f:66ff:fee1:eed#123 Enabled
> Jan  7 13:59:59 host1 ntpd[44765]: Listening on interface #4 lo, 127.0.0.1#123 Enabled
> Jan  7 13:59:59 host1 ntpd[44765]: Listening on interface #5 em2, 192.168.1.151#123 Enabled
> Jan  7 13:59:59 host1 ntpd[44765]: Listening on routing socket on fd #22 for interface updates
> Jan  7 13:59:59 host1 ntpd[44765]: kernel time sync status 2040
> Jan  7 13:59:59 host1 ntpd[44765]: frequency initialized 499.399 PPM from /var/lib/ntp/drift
> Jan  7 14:00:01 host1 ntpd_initres[32103]: parent died before we finished, exiting
> Jan  7 14:04:17 host1 ntpd[44765]: synchronized to 192.168.0.191, stratum 2
> Jan  7 14:04:17 host1 ntpd[44765]: kernel time sync status change 2001
> Jan  7 14:26:02 host1 snmpd[4842]: Received TERM or STOP signal...  shutting down...
> Jan  7 14:26:02 host1 kernel: netlink: 12 bytes leftover after parsing attributes.
> Jan  7 14:26:02 host1 snmpd[45667]: NET-SNMP version 5.5
> Jan  7 14:52:48 host1 ntpd[44765]: no servers reachable
> 
> It looks likely that NameNode received the SIGTERM signal sent by stopping ntpd command.
> Up to now, the problem has happened three times repeatedly, the time point was Jan  7 14:00:00, Jan 14 14:00:00 and Feb  4 14:00:00 respectively.
> Although the script to synchronize time is a little improper, and i also know the correct synchronized way. but i wonder why NameNode can receive the SIGTERM signal sent by stopping ntpd command? and why three times all happened at 14:00:00?
> Any ideas can be appreciated.
>

Re: Stopping ntpd signals SIGTERM, then causes namenode exit

Posted by Alexander Alten-Lorenz <wg...@gmail.com>.

I would spot on 

Jan  7 14:52:48 host1 ntpd[44765]: no servers reachable

looks for me like an network / DNS issue. You can check per dmesg whats going on, too.

BR
- Alexander

> On 09 Feb 2015, at 17:57, daemeon reiydelle <da...@gmail.com> wrote:
> 
> Absolutely a critical error to lose the configured ntpd time source in Hadoop. The replication and many other services require absolutely millisecond time sync between the nodes. Interesting that your SRE design called for ntpd running on each node. Curious.
> 
> What is the problem you are trying to solve by stopping ntpd on the local host? Did someone not understand how ntpd works? Did someone configure it to (I sure hope not) be free running?
> 
> 
> 
> .......
> “Life should not be a journey to the grave with the intention of arriving safely in a
> pretty and well preserved body, but rather to skid in broadside in a cloud of smoke,
> thoroughly used up, totally worn out, and loudly proclaiming “Wow! What a Ride!” 
> - Hunter Thompson
> 
> Daemeon C.M. Reiydelle
> USA (+1) 415.501.0198
> London (+44) (0) 20 8144 9872
> 
> On Sun, Feb 8, 2015 at 7:30 PM, David chen <c77_cn@163.com <ma...@163.com>> wrote:
> A shell script is deployed on every node of HDFS cluster, the script is invoked hourly by crontab, and its content is as follows:
> #!/bin/bash
> service ntpd stop
> ntpdate 192.168.0.1 #it's a valid ntpd server in LAN
> service ntpd start
> chkconfig ntpd on
> 
> After several days, NameNode crashed suddenly, but its log seemed no other errors except the following:
> 2015-01-07 14:00:00,709 ERROR org.apache.hadoop.hdfs.server.namenode.NameNode: RECEIVED SIGNAL 15: SIGTERM
> 
> Inspected the Linux log(Centos /var/log/messages), also found the following clues:
> Jan  7 14:00:01 host1 ntpd[32101]: ntpd exiting on signal 15
> Jan  7 13:59:59 host1 ntpd[44764]: ntpd 4.2.4p8@1.1612-o Fri Feb 22 11:23:27 UTC 2013 (1)
> Jan  7 13:59:59 host1 ntpd[44765]: precision = 0.143 usec
> Jan  7 13:59:59 host1 ntpd[44765]: Listening on interface #0 wildcard, 0.0.0.0#123 Disabled
> Jan  7 13:59:59 host1 ntpd[44765]: Listening on interface #1 wildcard, ::#123 Disabled
> Jan  7 13:59:59 host1 ntpd[44765]: Listening on interface #2 lo, ::1#123 Enabled
> Jan  7 13:59:59 host1 ntpd[44765]: Listening on interface #3 em2, fe80::ca1f:66ff:fee1:eed#123 Enabled
> Jan  7 13:59:59 host1 ntpd[44765]: Listening on interface #4 lo, 127.0.0.1#123 Enabled
> Jan  7 13:59:59 host1 ntpd[44765]: Listening on interface #5 em2, 192.168.1.151#123 Enabled
> Jan  7 13:59:59 host1 ntpd[44765]: Listening on routing socket on fd #22 for interface updates
> Jan  7 13:59:59 host1 ntpd[44765]: kernel time sync status 2040
> Jan  7 13:59:59 host1 ntpd[44765]: frequency initialized 499.399 PPM from /var/lib/ntp/drift
> Jan  7 14:00:01 host1 ntpd_initres[32103]: parent died before we finished, exiting
> Jan  7 14:04:17 host1 ntpd[44765]: synchronized to 192.168.0.191, stratum 2
> Jan  7 14:04:17 host1 ntpd[44765]: kernel time sync status change 2001
> Jan  7 14:26:02 host1 snmpd[4842]: Received TERM or STOP signal...  shutting down...
> Jan  7 14:26:02 host1 kernel: netlink: 12 bytes leftover after parsing attributes.
> Jan  7 14:26:02 host1 snmpd[45667]: NET-SNMP version 5.5
> Jan  7 14:52:48 host1 ntpd[44765]: no servers reachable
> 
> It looks likely that NameNode received the SIGTERM signal sent by stopping ntpd command.
> Up to now, the problem has happened three times repeatedly, the time point was Jan  7 14:00:00, Jan 14 14:00:00 and Feb  4 14:00:00 respectively.
> Although the script to synchronize time is a little improper, and i also know the correct synchronized way. but i wonder why NameNode can receive the SIGTERM signal sent by stopping ntpd command? and why three times all happened at 14:00:00?
> Any ideas can be appreciated.
>

Re: Stopping ntpd signals SIGTERM, then causes namenode exit

Posted by Alexander Alten-Lorenz <wg...@gmail.com>.

I would spot on 

Jan  7 14:52:48 host1 ntpd[44765]: no servers reachable

looks for me like an network / DNS issue. You can check per dmesg whats going on, too.

BR
- Alexander

> On 09 Feb 2015, at 17:57, daemeon reiydelle <da...@gmail.com> wrote:
> 
> Absolutely a critical error to lose the configured ntpd time source in Hadoop. The replication and many other services require absolutely millisecond time sync between the nodes. Interesting that your SRE design called for ntpd running on each node. Curious.
> 
> What is the problem you are trying to solve by stopping ntpd on the local host? Did someone not understand how ntpd works? Did someone configure it to (I sure hope not) be free running?
> 
> 
> 
> .......
> “Life should not be a journey to the grave with the intention of arriving safely in a
> pretty and well preserved body, but rather to skid in broadside in a cloud of smoke,
> thoroughly used up, totally worn out, and loudly proclaiming “Wow! What a Ride!” 
> - Hunter Thompson
> 
> Daemeon C.M. Reiydelle
> USA (+1) 415.501.0198
> London (+44) (0) 20 8144 9872
> 
> On Sun, Feb 8, 2015 at 7:30 PM, David chen <c77_cn@163.com <ma...@163.com>> wrote:
> A shell script is deployed on every node of HDFS cluster, the script is invoked hourly by crontab, and its content is as follows:
> #!/bin/bash
> service ntpd stop
> ntpdate 192.168.0.1 #it's a valid ntpd server in LAN
> service ntpd start
> chkconfig ntpd on
> 
> After several days, NameNode crashed suddenly, but its log seemed no other errors except the following:
> 2015-01-07 14:00:00,709 ERROR org.apache.hadoop.hdfs.server.namenode.NameNode: RECEIVED SIGNAL 15: SIGTERM
> 
> Inspected the Linux log(Centos /var/log/messages), also found the following clues:
> Jan  7 14:00:01 host1 ntpd[32101]: ntpd exiting on signal 15
> Jan  7 13:59:59 host1 ntpd[44764]: ntpd 4.2.4p8@1.1612-o Fri Feb 22 11:23:27 UTC 2013 (1)
> Jan  7 13:59:59 host1 ntpd[44765]: precision = 0.143 usec
> Jan  7 13:59:59 host1 ntpd[44765]: Listening on interface #0 wildcard, 0.0.0.0#123 Disabled
> Jan  7 13:59:59 host1 ntpd[44765]: Listening on interface #1 wildcard, ::#123 Disabled
> Jan  7 13:59:59 host1 ntpd[44765]: Listening on interface #2 lo, ::1#123 Enabled
> Jan  7 13:59:59 host1 ntpd[44765]: Listening on interface #3 em2, fe80::ca1f:66ff:fee1:eed#123 Enabled
> Jan  7 13:59:59 host1 ntpd[44765]: Listening on interface #4 lo, 127.0.0.1#123 Enabled
> Jan  7 13:59:59 host1 ntpd[44765]: Listening on interface #5 em2, 192.168.1.151#123 Enabled
> Jan  7 13:59:59 host1 ntpd[44765]: Listening on routing socket on fd #22 for interface updates
> Jan  7 13:59:59 host1 ntpd[44765]: kernel time sync status 2040
> Jan  7 13:59:59 host1 ntpd[44765]: frequency initialized 499.399 PPM from /var/lib/ntp/drift
> Jan  7 14:00:01 host1 ntpd_initres[32103]: parent died before we finished, exiting
> Jan  7 14:04:17 host1 ntpd[44765]: synchronized to 192.168.0.191, stratum 2
> Jan  7 14:04:17 host1 ntpd[44765]: kernel time sync status change 2001
> Jan  7 14:26:02 host1 snmpd[4842]: Received TERM or STOP signal...  shutting down...
> Jan  7 14:26:02 host1 kernel: netlink: 12 bytes leftover after parsing attributes.
> Jan  7 14:26:02 host1 snmpd[45667]: NET-SNMP version 5.5
> Jan  7 14:52:48 host1 ntpd[44765]: no servers reachable
> 
> It looks likely that NameNode received the SIGTERM signal sent by stopping ntpd command.
> Up to now, the problem has happened three times repeatedly, the time point was Jan  7 14:00:00, Jan 14 14:00:00 and Feb  4 14:00:00 respectively.
> Although the script to synchronize time is a little improper, and i also know the correct synchronized way. but i wonder why NameNode can receive the SIGTERM signal sent by stopping ntpd command? and why three times all happened at 14:00:00?
> Any ideas can be appreciated.
>

Re: Stopping ntpd signals SIGTERM, then causes namenode exit

Posted by Alexander Alten-Lorenz <wg...@gmail.com>.

I would spot on 

Jan  7 14:52:48 host1 ntpd[44765]: no servers reachable

looks for me like an network / DNS issue. You can check per dmesg whats going on, too.

BR
- Alexander

> On 09 Feb 2015, at 17:57, daemeon reiydelle <da...@gmail.com> wrote:
> 
> Absolutely a critical error to lose the configured ntpd time source in Hadoop. The replication and many other services require absolutely millisecond time sync between the nodes. Interesting that your SRE design called for ntpd running on each node. Curious.
> 
> What is the problem you are trying to solve by stopping ntpd on the local host? Did someone not understand how ntpd works? Did someone configure it to (I sure hope not) be free running?
> 
> 
> 
> .......
> “Life should not be a journey to the grave with the intention of arriving safely in a
> pretty and well preserved body, but rather to skid in broadside in a cloud of smoke,
> thoroughly used up, totally worn out, and loudly proclaiming “Wow! What a Ride!” 
> - Hunter Thompson
> 
> Daemeon C.M. Reiydelle
> USA (+1) 415.501.0198
> London (+44) (0) 20 8144 9872
> 
> On Sun, Feb 8, 2015 at 7:30 PM, David chen <c77_cn@163.com <ma...@163.com>> wrote:
> A shell script is deployed on every node of HDFS cluster, the script is invoked hourly by crontab, and its content is as follows:
> #!/bin/bash
> service ntpd stop
> ntpdate 192.168.0.1 #it's a valid ntpd server in LAN
> service ntpd start
> chkconfig ntpd on
> 
> After several days, NameNode crashed suddenly, but its log seemed no other errors except the following:
> 2015-01-07 14:00:00,709 ERROR org.apache.hadoop.hdfs.server.namenode.NameNode: RECEIVED SIGNAL 15: SIGTERM
> 
> Inspected the Linux log(Centos /var/log/messages), also found the following clues:
> Jan  7 14:00:01 host1 ntpd[32101]: ntpd exiting on signal 15
> Jan  7 13:59:59 host1 ntpd[44764]: ntpd 4.2.4p8@1.1612-o Fri Feb 22 11:23:27 UTC 2013 (1)
> Jan  7 13:59:59 host1 ntpd[44765]: precision = 0.143 usec
> Jan  7 13:59:59 host1 ntpd[44765]: Listening on interface #0 wildcard, 0.0.0.0#123 Disabled
> Jan  7 13:59:59 host1 ntpd[44765]: Listening on interface #1 wildcard, ::#123 Disabled
> Jan  7 13:59:59 host1 ntpd[44765]: Listening on interface #2 lo, ::1#123 Enabled
> Jan  7 13:59:59 host1 ntpd[44765]: Listening on interface #3 em2, fe80::ca1f:66ff:fee1:eed#123 Enabled
> Jan  7 13:59:59 host1 ntpd[44765]: Listening on interface #4 lo, 127.0.0.1#123 Enabled
> Jan  7 13:59:59 host1 ntpd[44765]: Listening on interface #5 em2, 192.168.1.151#123 Enabled
> Jan  7 13:59:59 host1 ntpd[44765]: Listening on routing socket on fd #22 for interface updates
> Jan  7 13:59:59 host1 ntpd[44765]: kernel time sync status 2040
> Jan  7 13:59:59 host1 ntpd[44765]: frequency initialized 499.399 PPM from /var/lib/ntp/drift
> Jan  7 14:00:01 host1 ntpd_initres[32103]: parent died before we finished, exiting
> Jan  7 14:04:17 host1 ntpd[44765]: synchronized to 192.168.0.191, stratum 2
> Jan  7 14:04:17 host1 ntpd[44765]: kernel time sync status change 2001
> Jan  7 14:26:02 host1 snmpd[4842]: Received TERM or STOP signal...  shutting down...
> Jan  7 14:26:02 host1 kernel: netlink: 12 bytes leftover after parsing attributes.
> Jan  7 14:26:02 host1 snmpd[45667]: NET-SNMP version 5.5
> Jan  7 14:52:48 host1 ntpd[44765]: no servers reachable
> 
> It looks likely that NameNode received the SIGTERM signal sent by stopping ntpd command.
> Up to now, the problem has happened three times repeatedly, the time point was Jan  7 14:00:00, Jan 14 14:00:00 and Feb  4 14:00:00 respectively.
> Although the script to synchronize time is a little improper, and i also know the correct synchronized way. but i wonder why NameNode can receive the SIGTERM signal sent by stopping ntpd command? and why three times all happened at 14:00:00?
> Any ideas can be appreciated.
>

Re: Stopping ntpd signals SIGTERM, then causes namenode exit

Posted by daemeon reiydelle <da...@gmail.com>.

Absolutely a critical error to lose the configured ntpd time source in
Hadoop. The replication and many other services require absolutely
millisecond time sync between the nodes. Interesting that your SRE design
called for ntpd running on each node. Curious.

What is the problem you are trying to solve by stopping ntpd on the local
host? Did someone not understand how ntpd works? Did someone configure it
to (I sure hope not) be free running?




*.......*






*“Life should not be a journey to the grave with the intention of arriving
safely in apretty and well preserved body, but rather to skid in broadside
in a cloud of smoke,thoroughly used up, totally worn out, and loudly
proclaiming “Wow! What a Ride!” - Hunter ThompsonDaemeon C.M. ReiydelleUSA
(+1) 415.501.0198London (+44) (0) 20 8144 9872*

On Sun, Feb 8, 2015 at 7:30 PM, David chen <c7...@163.com> wrote:

> A shell script is deployed on every node of HDFS cluster, the script is
> invoked hourly by crontab, and its content is as follows:
> #!/bin/bash
> service ntpd stop
> ntpdate 192.168.0.1 #it's a valid ntpd server in LAN
> service ntpd start
> chkconfig ntpd on
>
> After several days, NameNode crashed suddenly, but its log seemed no other
> errors except the following:
> 2015-01-07 14:00:00,709 ERROR
> org.apache.hadoop.hdfs.server.namenode.NameNode: RECEIVED SIGNAL 15: SIGTERM
>
> Inspected the Linux log(Centos /var/log/messages), also found the
> following clues:
> Jan  7 14:00:01 host1 ntpd[32101]: ntpd exiting on signal 15
> Jan  7 13:59:59 host1 ntpd[44764]: ntpd 4.2.4p8@1.1612-o Fri Feb 22
> 11:23:27 UTC 2013 (1)
> Jan  7 13:59:59 host1 ntpd[44765]: precision = 0.143 usec
> Jan  7 13:59:59 host1 ntpd[44765]: Listening on interface #0 wildcard,
> 0.0.0.0#123 Disabled
> Jan  7 13:59:59 host1 ntpd[44765]: Listening on interface #1 wildcard,
> ::#123 Disabled
> Jan  7 13:59:59 host1 ntpd[44765]: Listening on interface #2 lo, ::1#123
> Enabled
> Jan  7 13:59:59 host1 ntpd[44765]: Listening on interface #3 em2,
> fe80::ca1f:66ff:fee1:eed#123 Enabled
> Jan  7 13:59:59 host1 ntpd[44765]: Listening on interface #4 lo,
> 127.0.0.1#123 Enabled
> Jan  7 13:59:59 host1 ntpd[44765]: Listening on interface #5 em2,
> 192.168.1.151#123 Enabled
> Jan  7 13:59:59 host1 ntpd[44765]: Listening on routing socket on fd #22
> for interface updates
> Jan  7 13:59:59 host1 ntpd[44765]: kernel time sync status 2040
> Jan  7 13:59:59 host1 ntpd[44765]: frequency initialized 499.399 PPM from
> /var/lib/ntp/drift
> Jan  7 14:00:01 host1 ntpd_initres[32103]: parent died before we finished,
> exiting
> Jan  7 14:04:17 host1 ntpd[44765]: synchronized to 192.168.0.191, stratum 2
> Jan  7 14:04:17 host1 ntpd[44765]: kernel time sync status change 2001
> Jan  7 14:26:02 host1 snmpd[4842]: Received TERM or STOP signal...
>  shutting down...
> Jan  7 14:26:02 host1 kernel: netlink: 12 bytes leftover after parsing
> attributes.
> Jan  7 14:26:02 host1 snmpd[45667]: NET-SNMP version 5.5
> Jan  7 14:52:48 host1 ntpd[44765]: no servers reachable
>
> It looks likely that NameNode received the SIGTERM signal sent by
> stopping ntpd command.
> Up to now, the problem has happened three times repeatedly, the time point
> was Jan  7 14:00:00, Jan 14 14:00:00 and Feb  4 14:00:00 respectively.
> Although the script to synchronize time is a little improper, and i also
> know the correct synchronized way. but i wonder why NameNode can receive
> the SIGTERM signal sent by stopping ntpd command? and why three times all
> happened at 14:00:00?
> Any ideas can be appreciated.
>

Re: Stopping ntpd signals SIGTERM, then causes namenode exit

Posted by daemeon reiydelle <da...@gmail.com>.

Absolutely a critical error to lose the configured ntpd time source in
Hadoop. The replication and many other services require absolutely
millisecond time sync between the nodes. Interesting that your SRE design
called for ntpd running on each node. Curious.

What is the problem you are trying to solve by stopping ntpd on the local
host? Did someone not understand how ntpd works? Did someone configure it
to (I sure hope not) be free running?




*.......*






*“Life should not be a journey to the grave with the intention of arriving
safely in apretty and well preserved body, but rather to skid in broadside
in a cloud of smoke,thoroughly used up, totally worn out, and loudly
proclaiming “Wow! What a Ride!” - Hunter ThompsonDaemeon C.M. ReiydelleUSA
(+1) 415.501.0198London (+44) (0) 20 8144 9872*

On Sun, Feb 8, 2015 at 7:30 PM, David chen <c7...@163.com> wrote:

> A shell script is deployed on every node of HDFS cluster, the script is
> invoked hourly by crontab, and its content is as follows:
> #!/bin/bash
> service ntpd stop
> ntpdate 192.168.0.1 #it's a valid ntpd server in LAN
> service ntpd start
> chkconfig ntpd on
>
> After several days, NameNode crashed suddenly, but its log seemed no other
> errors except the following:
> 2015-01-07 14:00:00,709 ERROR
> org.apache.hadoop.hdfs.server.namenode.NameNode: RECEIVED SIGNAL 15: SIGTERM
>
> Inspected the Linux log(Centos /var/log/messages), also found the
> following clues:
> Jan  7 14:00:01 host1 ntpd[32101]: ntpd exiting on signal 15
> Jan  7 13:59:59 host1 ntpd[44764]: ntpd 4.2.4p8@1.1612-o Fri Feb 22
> 11:23:27 UTC 2013 (1)
> Jan  7 13:59:59 host1 ntpd[44765]: precision = 0.143 usec
> Jan  7 13:59:59 host1 ntpd[44765]: Listening on interface #0 wildcard,
> 0.0.0.0#123 Disabled
> Jan  7 13:59:59 host1 ntpd[44765]: Listening on interface #1 wildcard,
> ::#123 Disabled
> Jan  7 13:59:59 host1 ntpd[44765]: Listening on interface #2 lo, ::1#123
> Enabled
> Jan  7 13:59:59 host1 ntpd[44765]: Listening on interface #3 em2,
> fe80::ca1f:66ff:fee1:eed#123 Enabled
> Jan  7 13:59:59 host1 ntpd[44765]: Listening on interface #4 lo,
> 127.0.0.1#123 Enabled
> Jan  7 13:59:59 host1 ntpd[44765]: Listening on interface #5 em2,
> 192.168.1.151#123 Enabled
> Jan  7 13:59:59 host1 ntpd[44765]: Listening on routing socket on fd #22
> for interface updates
> Jan  7 13:59:59 host1 ntpd[44765]: kernel time sync status 2040
> Jan  7 13:59:59 host1 ntpd[44765]: frequency initialized 499.399 PPM from
> /var/lib/ntp/drift
> Jan  7 14:00:01 host1 ntpd_initres[32103]: parent died before we finished,
> exiting
> Jan  7 14:04:17 host1 ntpd[44765]: synchronized to 192.168.0.191, stratum 2
> Jan  7 14:04:17 host1 ntpd[44765]: kernel time sync status change 2001
> Jan  7 14:26:02 host1 snmpd[4842]: Received TERM or STOP signal...
>  shutting down...
> Jan  7 14:26:02 host1 kernel: netlink: 12 bytes leftover after parsing
> attributes.
> Jan  7 14:26:02 host1 snmpd[45667]: NET-SNMP version 5.5
> Jan  7 14:52:48 host1 ntpd[44765]: no servers reachable
>
> It looks likely that NameNode received the SIGTERM signal sent by
> stopping ntpd command.
> Up to now, the problem has happened three times repeatedly, the time point
> was Jan  7 14:00:00, Jan 14 14:00:00 and Feb  4 14:00:00 respectively.
> Although the script to synchronize time is a little improper, and i also
> know the correct synchronized way. but i wonder why NameNode can receive
> the SIGTERM signal sent by stopping ntpd command? and why three times all
> happened at 14:00:00?
> Any ideas can be appreciated.
>

Re: Stopping ntpd signals SIGTERM, then causes namenode exit

Posted by daemeon reiydelle <da...@gmail.com>.

Absolutely a critical error to lose the configured ntpd time source in
Hadoop. The replication and many other services require absolutely
millisecond time sync between the nodes. Interesting that your SRE design
called for ntpd running on each node. Curious.

What is the problem you are trying to solve by stopping ntpd on the local
host? Did someone not understand how ntpd works? Did someone configure it
to (I sure hope not) be free running?




*.......*






*“Life should not be a journey to the grave with the intention of arriving
safely in apretty and well preserved body, but rather to skid in broadside
in a cloud of smoke,thoroughly used up, totally worn out, and loudly
proclaiming “Wow! What a Ride!” - Hunter ThompsonDaemeon C.M. ReiydelleUSA
(+1) 415.501.0198London (+44) (0) 20 8144 9872*

On Sun, Feb 8, 2015 at 7:30 PM, David chen <c7...@163.com> wrote:

> A shell script is deployed on every node of HDFS cluster, the script is
> invoked hourly by crontab, and its content is as follows:
> #!/bin/bash
> service ntpd stop
> ntpdate 192.168.0.1 #it's a valid ntpd server in LAN
> service ntpd start
> chkconfig ntpd on
>
> After several days, NameNode crashed suddenly, but its log seemed no other
> errors except the following:
> 2015-01-07 14:00:00,709 ERROR
> org.apache.hadoop.hdfs.server.namenode.NameNode: RECEIVED SIGNAL 15: SIGTERM
>
> Inspected the Linux log(Centos /var/log/messages), also found the
> following clues:
> Jan  7 14:00:01 host1 ntpd[32101]: ntpd exiting on signal 15
> Jan  7 13:59:59 host1 ntpd[44764]: ntpd 4.2.4p8@1.1612-o Fri Feb 22
> 11:23:27 UTC 2013 (1)
> Jan  7 13:59:59 host1 ntpd[44765]: precision = 0.143 usec
> Jan  7 13:59:59 host1 ntpd[44765]: Listening on interface #0 wildcard,
> 0.0.0.0#123 Disabled
> Jan  7 13:59:59 host1 ntpd[44765]: Listening on interface #1 wildcard,
> ::#123 Disabled
> Jan  7 13:59:59 host1 ntpd[44765]: Listening on interface #2 lo, ::1#123
> Enabled
> Jan  7 13:59:59 host1 ntpd[44765]: Listening on interface #3 em2,
> fe80::ca1f:66ff:fee1:eed#123 Enabled
> Jan  7 13:59:59 host1 ntpd[44765]: Listening on interface #4 lo,
> 127.0.0.1#123 Enabled
> Jan  7 13:59:59 host1 ntpd[44765]: Listening on interface #5 em2,
> 192.168.1.151#123 Enabled
> Jan  7 13:59:59 host1 ntpd[44765]: Listening on routing socket on fd #22
> for interface updates
> Jan  7 13:59:59 host1 ntpd[44765]: kernel time sync status 2040
> Jan  7 13:59:59 host1 ntpd[44765]: frequency initialized 499.399 PPM from
> /var/lib/ntp/drift
> Jan  7 14:00:01 host1 ntpd_initres[32103]: parent died before we finished,
> exiting
> Jan  7 14:04:17 host1 ntpd[44765]: synchronized to 192.168.0.191, stratum 2
> Jan  7 14:04:17 host1 ntpd[44765]: kernel time sync status change 2001
> Jan  7 14:26:02 host1 snmpd[4842]: Received TERM or STOP signal...
>  shutting down...
> Jan  7 14:26:02 host1 kernel: netlink: 12 bytes leftover after parsing
> attributes.
> Jan  7 14:26:02 host1 snmpd[45667]: NET-SNMP version 5.5
> Jan  7 14:52:48 host1 ntpd[44765]: no servers reachable
>
> It looks likely that NameNode received the SIGTERM signal sent by
> stopping ntpd command.
> Up to now, the problem has happened three times repeatedly, the time point
> was Jan  7 14:00:00, Jan 14 14:00:00 and Feb  4 14:00:00 respectively.
> Although the script to synchronize time is a little improper, and i also
> know the correct synchronized way. but i wonder why NameNode can receive
> the SIGTERM signal sent by stopping ntpd command? and why three times all
> happened at 14:00:00?
> Any ideas can be appreciated.
>

Re: Stopping ntpd signals SIGTERM, then causes namenode exit

Posted by daemeon reiydelle <da...@gmail.com>.

Absolutely a critical error to lose the configured ntpd time source in
Hadoop. The replication and many other services require absolutely
millisecond time sync between the nodes. Interesting that your SRE design
called for ntpd running on each node. Curious.

What is the problem you are trying to solve by stopping ntpd on the local
host? Did someone not understand how ntpd works? Did someone configure it
to (I sure hope not) be free running?




*.......*






*“Life should not be a journey to the grave with the intention of arriving
safely in apretty and well preserved body, but rather to skid in broadside
in a cloud of smoke,thoroughly used up, totally worn out, and loudly
proclaiming “Wow! What a Ride!” - Hunter ThompsonDaemeon C.M. ReiydelleUSA
(+1) 415.501.0198London (+44) (0) 20 8144 9872*

On Sun, Feb 8, 2015 at 7:30 PM, David chen <c7...@163.com> wrote:

> A shell script is deployed on every node of HDFS cluster, the script is
> invoked hourly by crontab, and its content is as follows:
> #!/bin/bash
> service ntpd stop
> ntpdate 192.168.0.1 #it's a valid ntpd server in LAN
> service ntpd start
> chkconfig ntpd on
>
> After several days, NameNode crashed suddenly, but its log seemed no other
> errors except the following:
> 2015-01-07 14:00:00,709 ERROR
> org.apache.hadoop.hdfs.server.namenode.NameNode: RECEIVED SIGNAL 15: SIGTERM
>
> Inspected the Linux log(Centos /var/log/messages), also found the
> following clues:
> Jan  7 14:00:01 host1 ntpd[32101]: ntpd exiting on signal 15
> Jan  7 13:59:59 host1 ntpd[44764]: ntpd 4.2.4p8@1.1612-o Fri Feb 22
> 11:23:27 UTC 2013 (1)
> Jan  7 13:59:59 host1 ntpd[44765]: precision = 0.143 usec
> Jan  7 13:59:59 host1 ntpd[44765]: Listening on interface #0 wildcard,
> 0.0.0.0#123 Disabled
> Jan  7 13:59:59 host1 ntpd[44765]: Listening on interface #1 wildcard,
> ::#123 Disabled
> Jan  7 13:59:59 host1 ntpd[44765]: Listening on interface #2 lo, ::1#123
> Enabled
> Jan  7 13:59:59 host1 ntpd[44765]: Listening on interface #3 em2,
> fe80::ca1f:66ff:fee1:eed#123 Enabled
> Jan  7 13:59:59 host1 ntpd[44765]: Listening on interface #4 lo,
> 127.0.0.1#123 Enabled
> Jan  7 13:59:59 host1 ntpd[44765]: Listening on interface #5 em2,
> 192.168.1.151#123 Enabled
> Jan  7 13:59:59 host1 ntpd[44765]: Listening on routing socket on fd #22
> for interface updates
> Jan  7 13:59:59 host1 ntpd[44765]: kernel time sync status 2040
> Jan  7 13:59:59 host1 ntpd[44765]: frequency initialized 499.399 PPM from
> /var/lib/ntp/drift
> Jan  7 14:00:01 host1 ntpd_initres[32103]: parent died before we finished,
> exiting
> Jan  7 14:04:17 host1 ntpd[44765]: synchronized to 192.168.0.191, stratum 2
> Jan  7 14:04:17 host1 ntpd[44765]: kernel time sync status change 2001
> Jan  7 14:26:02 host1 snmpd[4842]: Received TERM or STOP signal...
>  shutting down...
> Jan  7 14:26:02 host1 kernel: netlink: 12 bytes leftover after parsing
> attributes.
> Jan  7 14:26:02 host1 snmpd[45667]: NET-SNMP version 5.5
> Jan  7 14:52:48 host1 ntpd[44765]: no servers reachable
>
> It looks likely that NameNode received the SIGTERM signal sent by
> stopping ntpd command.
> Up to now, the problem has happened three times repeatedly, the time point
> was Jan  7 14:00:00, Jan 14 14:00:00 and Feb  4 14:00:00 respectively.
> Although the script to synchronize time is a little improper, and i also
> know the correct synchronized way. but i wonder why NameNode can receive
> the SIGTERM signal sent by stopping ntpd command? and why three times all
> happened at 14:00:00?
> Any ideas can be appreciated.
>