You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@cloudstack.apache.org by Hugo Trippaers <ht...@schubergphilis.com> on 2013/03/05 09:47:49 UTC

Re: Review Request: take into account potential NFS timeouts when determining if xenheartbeat timeout value has been met.

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/9647/#review17390
-----------------------------------------------------------

Ship it!


Ship It!

- Hugo Trippaers


On Feb. 27, 2013, 9:06 a.m., Brenn Oosterbaan wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/9647/
> -----------------------------------------------------------
> 
> (Updated Feb. 27, 2013, 9:06 a.m.)
> 
> 
> Review request for cloudstack and Hugo Trippaers.
> 
> 
> Description
> -------
> 
> In some storage failure scenario’s the NFS timeout can cause writing the heartbeat to take longer than expected. By comparing the last successful heartbeat epoch with the current epoch we check if the timeout value has been met.
> 
> 
> Diffs
> -----
> 
>   scripts/vm/hypervisor/xenserver/xenheartbeat.sh 5edacf7 
> 
> Diff: https://reviews.apache.org/r/9647/diff/
> 
> 
> Testing
> -------
> 
> Tested on hostxxx with an empty heartbeat file:
> Feb 26 21:54:13 hostxxx heartbeat: Problem with heartbeat, no iSCSI or NFS mount defined in /opt/xensource/bin/heartbeat!
> 
> Tested on hostxxx with a 120 seconds timeout value by causing a storage failover (hits NFS timeout):
> Feb 26 08:04:15 hostxxx heartbeat: Potential problem with /var/run/sr-mount/d392d770-330b-bdbf-9c07-e1c38af81c6e/hb-faecefb3-9ac0-47a2-b0fb-ae383762ba13: not reachable since 18 seconds
> Feb 26 08:04:48 hostxxx heartbeat: Potential problem with /var/run/sr-mount/d392d770-330b-bdbf-9c07-e1c38af81c6e/hb-faecefb3-9ac0-47a2-b0fb-ae383762ba13: not reachable since 51 seconds
> Feb 26 08:05:20 hostxxx heartbeat: Potential problem with /var/run/sr-mount/d392d770-330b-bdbf-9c07-e1c38af81c6e/hb-faecefb3-9ac0-47a2-b0fb-ae383762ba13: not reachable since 83 seconds
> The storage failover stayed within the 120 seconds timeout value so no reboot
> 
> Tested on hostxxx with a 120 second timeout by removing the storage altogether (hits NFS timeout):
> Feb 26 10:08:52 hostxxx heartbeat: Potential problem with /var/run/sr-mount/test/hb-test: not reachable since 32 seconds
> Feb 26 10:09:24 hostxxx heartbeat: Potential problem with /var/run/sr-mount/test/hb-test: not reachable since 64 seconds
> Feb 26 10:09:57 hostxxx heartbeat: Potential problem with /var/run/sr-mount/test/hb-test: not reachable since 97 seconds
> Feb 26 10:10:29 hostxxx heartbeat: Potential problem with /var/run/sr-mount/test/hb-test: not reachable since 129 seconds
> Feb 26 10:10:29 hostxxx heartbeat: Problem with /var/run/sr-mount/test/hb-test: not reachable since 129 seconds, rebooting system!
> 
> Tested on hostxxx with a 120 second timeout by removing write rights on the storage (does not hit NFS timeout):
> Feb 26 10:22:13 hostxxx heartbeat: Potential problem with /var/run/sr-mount/test/hb-test: not reachable since 5 seconds
> Feb 26 10:22:18 hostxxx heartbeat: Potential problem with /var/run/sr-mount/test/hb-test: not reachable since 10 seconds
> Feb 26 10:22:23 hostxxx heartbeat: Potential problem with /var/run/sr-mount/test/hb-test: not reachable since 15 seconds
> Feb 26 10:22:28 hostxxx heartbeat: Potential problem with /var/run/sr-mount/test/hb-test: not reachable since 20 seconds
> Feb 26 10:22:33 hostxxx heartbeat: Potential problem with /var/run/sr-mount/test/hb-test: not reachable since 25 seconds
> Feb 26 10:22:38 hostxxx heartbeat: Potential problem with /var/run/sr-mount/test/hb-test: not reachable since 30 seconds
> Feb 26 10:22:43 hostxxx heartbeat: Potential problem with /var/run/sr-mount/test/hb-test: not reachable since 35 seconds
> Feb 26 10:22:48 hostxxx heartbeat: Potential problem with /var/run/sr-mount/test/hb-test: not reachable since 40 seconds
> Feb 26 10:22:53 hostxxx heartbeat: Potential problem with /var/run/sr-mount/test/hb-test: not reachable since 45 seconds
> Feb 26 10:22:58 hostxxx heartbeat: Potential problem with /var/run/sr-mount/test/hb-test: not reachable since 50 seconds
> Feb 26 10:23:03 hostxxx heartbeat: Potential problem with /var/run/sr-mount/test/hb-test: not reachable since 55 seconds
> Feb 26 10:23:08 hostxxx heartbeat: Potential problem with /var/run/sr-mount/test/hb-test: not reachable since 60 seconds
> Feb 26 10:23:13 hostxxx heartbeat: Potential problem with /var/run/sr-mount/test/hb-test: not reachable since 65 seconds
> Feb 26 10:23:18 hostxxx heartbeat: Potential problem with /var/run/sr-mount/test/hb-test: not reachable since 70 seconds
> Feb 26 10:23:23 hostxxx heartbeat: Potential problem with /var/run/sr-mount/test/hb-test: not reachable since 75 seconds
> Feb 26 10:23:28 hostxxx heartbeat: Potential problem with /var/run/sr-mount/test/hb-test: not reachable since 80 seconds
> Feb 26 10:23:33 hostxxx heartbeat: Potential problem with /var/run/sr-mount/test/hb-test: not reachable since 85 seconds
> Feb 26 10:23:38 hostxxx heartbeat: Potential problem with /var/run/sr-mount/test/hb-test: not reachable since 90 seconds
> Feb 26 10:23:43 hostxxx heartbeat: Potential problem with /var/run/sr-mount/test/hb-test: not reachable since 95 seconds
> Feb 26 10:23:48 hostxxx heartbeat: Potential problem with /var/run/sr-mount/test/hb-test: not reachable since 100 seconds
> Feb 26 10:23:53 hostxxx heartbeat: Potential problem with /var/run/sr-mount/test/hb-test: not reachable since 105 seconds
> Feb 26 10:23:58 hostxxx heartbeat: Potential problem with /var/run/sr-mount/test/hb-test: not reachable since 110 seconds
> Feb 26 10:24:03 hostxxx heartbeat: Potential problem with /var/run/sr-mount/test/hb-test: not reachable since 115 seconds
> Feb 26 10:24:08 hostxxx heartbeat: Potential problem with /var/run/sr-mount/test/hb-test: not reachable since 120 seconds
> Feb 26 10:24:08 hostxxx heartbeat: Problem with /var/run/sr-mount/test/hb-test: not reachable for 120 seconds, rebooting system!
> 
> 
> Thanks,
> 
> Brenn Oosterbaan
> 
>


Re: Review Request: take into account potential NFS timeouts when determining if xenheartbeat timeout value has been met.

Posted by Hugo Trippaers <ht...@schubergphilis.com>.

> On March 5, 2013, 8:47 a.m., Hugo Trippaers wrote:
> > Ship It!

Commit on master: e8b6f6658280f858e6c15a8b4e5ac4b74eff4490


- Hugo


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/9647/#review17390
-----------------------------------------------------------


On Feb. 27, 2013, 9:06 a.m., Brenn Oosterbaan wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/9647/
> -----------------------------------------------------------
> 
> (Updated Feb. 27, 2013, 9:06 a.m.)
> 
> 
> Review request for cloudstack and Hugo Trippaers.
> 
> 
> Description
> -------
> 
> In some storage failure scenario’s the NFS timeout can cause writing the heartbeat to take longer than expected. By comparing the last successful heartbeat epoch with the current epoch we check if the timeout value has been met.
> 
> 
> Diffs
> -----
> 
>   scripts/vm/hypervisor/xenserver/xenheartbeat.sh 5edacf7 
> 
> Diff: https://reviews.apache.org/r/9647/diff/
> 
> 
> Testing
> -------
> 
> Tested on hostxxx with an empty heartbeat file:
> Feb 26 21:54:13 hostxxx heartbeat: Problem with heartbeat, no iSCSI or NFS mount defined in /opt/xensource/bin/heartbeat!
> 
> Tested on hostxxx with a 120 seconds timeout value by causing a storage failover (hits NFS timeout):
> Feb 26 08:04:15 hostxxx heartbeat: Potential problem with /var/run/sr-mount/d392d770-330b-bdbf-9c07-e1c38af81c6e/hb-faecefb3-9ac0-47a2-b0fb-ae383762ba13: not reachable since 18 seconds
> Feb 26 08:04:48 hostxxx heartbeat: Potential problem with /var/run/sr-mount/d392d770-330b-bdbf-9c07-e1c38af81c6e/hb-faecefb3-9ac0-47a2-b0fb-ae383762ba13: not reachable since 51 seconds
> Feb 26 08:05:20 hostxxx heartbeat: Potential problem with /var/run/sr-mount/d392d770-330b-bdbf-9c07-e1c38af81c6e/hb-faecefb3-9ac0-47a2-b0fb-ae383762ba13: not reachable since 83 seconds
> The storage failover stayed within the 120 seconds timeout value so no reboot
> 
> Tested on hostxxx with a 120 second timeout by removing the storage altogether (hits NFS timeout):
> Feb 26 10:08:52 hostxxx heartbeat: Potential problem with /var/run/sr-mount/test/hb-test: not reachable since 32 seconds
> Feb 26 10:09:24 hostxxx heartbeat: Potential problem with /var/run/sr-mount/test/hb-test: not reachable since 64 seconds
> Feb 26 10:09:57 hostxxx heartbeat: Potential problem with /var/run/sr-mount/test/hb-test: not reachable since 97 seconds
> Feb 26 10:10:29 hostxxx heartbeat: Potential problem with /var/run/sr-mount/test/hb-test: not reachable since 129 seconds
> Feb 26 10:10:29 hostxxx heartbeat: Problem with /var/run/sr-mount/test/hb-test: not reachable since 129 seconds, rebooting system!
> 
> Tested on hostxxx with a 120 second timeout by removing write rights on the storage (does not hit NFS timeout):
> Feb 26 10:22:13 hostxxx heartbeat: Potential problem with /var/run/sr-mount/test/hb-test: not reachable since 5 seconds
> Feb 26 10:22:18 hostxxx heartbeat: Potential problem with /var/run/sr-mount/test/hb-test: not reachable since 10 seconds
> Feb 26 10:22:23 hostxxx heartbeat: Potential problem with /var/run/sr-mount/test/hb-test: not reachable since 15 seconds
> Feb 26 10:22:28 hostxxx heartbeat: Potential problem with /var/run/sr-mount/test/hb-test: not reachable since 20 seconds
> Feb 26 10:22:33 hostxxx heartbeat: Potential problem with /var/run/sr-mount/test/hb-test: not reachable since 25 seconds
> Feb 26 10:22:38 hostxxx heartbeat: Potential problem with /var/run/sr-mount/test/hb-test: not reachable since 30 seconds
> Feb 26 10:22:43 hostxxx heartbeat: Potential problem with /var/run/sr-mount/test/hb-test: not reachable since 35 seconds
> Feb 26 10:22:48 hostxxx heartbeat: Potential problem with /var/run/sr-mount/test/hb-test: not reachable since 40 seconds
> Feb 26 10:22:53 hostxxx heartbeat: Potential problem with /var/run/sr-mount/test/hb-test: not reachable since 45 seconds
> Feb 26 10:22:58 hostxxx heartbeat: Potential problem with /var/run/sr-mount/test/hb-test: not reachable since 50 seconds
> Feb 26 10:23:03 hostxxx heartbeat: Potential problem with /var/run/sr-mount/test/hb-test: not reachable since 55 seconds
> Feb 26 10:23:08 hostxxx heartbeat: Potential problem with /var/run/sr-mount/test/hb-test: not reachable since 60 seconds
> Feb 26 10:23:13 hostxxx heartbeat: Potential problem with /var/run/sr-mount/test/hb-test: not reachable since 65 seconds
> Feb 26 10:23:18 hostxxx heartbeat: Potential problem with /var/run/sr-mount/test/hb-test: not reachable since 70 seconds
> Feb 26 10:23:23 hostxxx heartbeat: Potential problem with /var/run/sr-mount/test/hb-test: not reachable since 75 seconds
> Feb 26 10:23:28 hostxxx heartbeat: Potential problem with /var/run/sr-mount/test/hb-test: not reachable since 80 seconds
> Feb 26 10:23:33 hostxxx heartbeat: Potential problem with /var/run/sr-mount/test/hb-test: not reachable since 85 seconds
> Feb 26 10:23:38 hostxxx heartbeat: Potential problem with /var/run/sr-mount/test/hb-test: not reachable since 90 seconds
> Feb 26 10:23:43 hostxxx heartbeat: Potential problem with /var/run/sr-mount/test/hb-test: not reachable since 95 seconds
> Feb 26 10:23:48 hostxxx heartbeat: Potential problem with /var/run/sr-mount/test/hb-test: not reachable since 100 seconds
> Feb 26 10:23:53 hostxxx heartbeat: Potential problem with /var/run/sr-mount/test/hb-test: not reachable since 105 seconds
> Feb 26 10:23:58 hostxxx heartbeat: Potential problem with /var/run/sr-mount/test/hb-test: not reachable since 110 seconds
> Feb 26 10:24:03 hostxxx heartbeat: Potential problem with /var/run/sr-mount/test/hb-test: not reachable since 115 seconds
> Feb 26 10:24:08 hostxxx heartbeat: Potential problem with /var/run/sr-mount/test/hb-test: not reachable since 120 seconds
> Feb 26 10:24:08 hostxxx heartbeat: Problem with /var/run/sr-mount/test/hb-test: not reachable for 120 seconds, rebooting system!
> 
> 
> Thanks,
> 
> Brenn Oosterbaan
> 
>