You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@cloudstack.apache.org by Melanie Desaive <m....@heinlein-support.de> on 2018/06/20 13:53:07 UTC

com.cloud.agent.api.CheckRouterCommand timeout

Hi all,

we have a recurring problem with our virtual routers. By the log
messages it seems that com.cloud.agent.api.CheckRouterCommand runs into
a timeout and therefore switches to UNKNOWN.

All network traffic through the routers is still working. They can be
accessed by their link-local IP adresses, and configuration looks good
at a first sight. But configuration changes through the CloudStack API
do no longer reach the routers. A reboot fixes the problem.

I would like to investigate a little further but lack understanding
about how the checkRouter command is trying to access the virtual router.

Could someone point me to some relevant documentation or give a short
overview how the connection from CS-Management is done and where such an
timeout could occur?

As background information - the sequence from the management log looks
kind of this:

---

 x Every few seconds the com.cloud.agent.api.CheckRouterCommand returns
a state BACKUP or MASTER correctly
 x When the problem occurs the log messages change. Some snippets below

 x ... Waiting some more time because this is the current command
 x ... Waiting some more time because this is the current command
 x Could not find exception:
com.cloud.exception.OperationTimedoutException in error code list for
exceptions
 x Timed out on Seq 28-2352567855348137104
 x Seq 28-2352567855348137104: Cancelling.
 x Operation timed out: Commands 2352567855348137104 to Host 28 timed
out after 60
 x Unable to update router r-2594-VM's status
 x Redundant virtual router (name: r-2594-VM, id: 2594)  just switch
from MASTER to UNKNOWN

 x Those error messages are now repeated for each following
CheckRouterCommand until the virtual router is rebootet


Greetings,

Melanie

-- 
--

Heinlein Support GmbH
Linux: Akademie - Support - Hosting

http://www.heinlein-support.de
Tel: 030 / 40 50 51 - 0
Fax: 030 / 40 50 51 - 19

Zwangsangaben lt. §35a GmbHG:
HRB 93818 B / Amtsgericht Berlin-Charlottenburg,
Geschäftsführer: Peer Heinlein  -- Sitz: Berlin

Re: com.cloud.agent.api.CheckRouterCommand timeout

Posted by Melanie Desaive <m....@heinlein-support.de>.

Am 21.06.2018 um 17:08 schrieb Daan Hoogland:
> makes sense, well let's hope all breaks soon ;)

I am sure it will break! :D

And then I will get back to you with more questions!

Thanks a lot for taking the time!

> 
> On Thu, Jun 21, 2018 at 2:15 PM, Melanie Desaive <
> m.desaive@heinlein-support.de> wrote:
> 
>> Hi Daan,
>>
>> Am 21.06.2018 um 15:29 schrieb Daan Hoogland:
>>> Melanie, attachments get deleted for this list. Your assumption for the
>>> comm path is right for xen. Did you try and execute the script as it is
>>> called by the proxy script from the host? and capture the return? We had
>> a
>>> bad problem with getting the template version in the past on xen, this
>>> might be similar. That was due to processing of the returned string in
>> the
>>> script.
>>
>> I called both stages of the script manually but at at time, when all was
>> working as expected and the routers where back to MASTER and BACKUP.
>>
>> Looked like:
>>
>> [root@acs-compute-5 ~]# /opt/cloud/bin/router_proxy.sh checkrouter.sh
>> 169.254.1.178
>> Status: BACKUP
>>
>> root@r-2595-VM:~# /opt/cloud/bin/checkrouter.sh
>> Status: BACKUP
>>
>>
>>>
>>> On Thu, Jun 21, 2018 at 1:16 PM, Melanie Desaive <
>>> m.desaive@heinlein-support.de> wrote:
>>>
>>>> Hi Daan,
>>>>
>>>> thanks for your reply.
>>>>
>>>> The latest occurance of our VRs going to UNKNOWN did resolve 24 hours
>>>> after it had occured. Nevertheless I would appreciate some insight into
>>>> how the checkRouter command is handled, as I expect the problem to come
>>>> back again.
>>>> Am 21.06.2018 um 10:39 schrieb Daan Hoogland:
>>>>> Melanie, this depends a bit on the type of hypervisor. The command
>>>> executes
>>>>> the checkrouter.sh script on the virtual router if it reaches it, but
>> it
>>>>> seems your problem is before that. I would look at the network first
>> and
>>>>> follow the path that the execution takes for your hypervisortype.
>>>>
>>>> With Stephans help I figured out the following guess for the path of
>>>> connections for the checkrouter command. Could someone please correct
>>>> me, if my guess is not correct. ;)
>>>>
>>>>  x Management Nodes connects to XenServer hypervisor host via management
>>>> network on port 22 by SSH
>>>>  x On hypervisor host the wrapper script
>>>> "/opt/cloud/bin/router_proxy.sh" is used to call scripts on system VMs
>>>> via link-local IP and port 3922
>>>>  x On the VR the script "/opt/cloud/bin/checkrouter.sh" does the actual
>>>> check.
>>>>
>>>> In our case the API call times out with log messages
>>>>  x Operation timed out: Commands 1063975411966525473 to Host 29 timed
>>>> out after 60
>>>>  x Unable to update router r-2595-VM's status
>>>>  x Redundant virtual router (name: r-2595-VM, id: 2595)  just switch
>>>> from BACKUP to UNKNOWN
>>>>
>>>> To me it seems that this is a timeout that occurs when ACS management is
>>>> waitig for the API call to return. At what stage (management host <->
>>>> virtualization host) or (virutalization host <-> VR> the answer is
>>>> delayed is unclear to me. (SSH Login from virtualization host to VR via
>>>> link-local is working all the time)
>>>>
>>>> And it is unclear to me, why both VRs of the respective network stay in
>>>> UNKNOWN for 24 hours, are accessible via link-local but come back
>>>> immedately after a reboot.
>>>>
>>>> I am happy for any suggestions or explanations in this topic and will
>>>> investigate further as soon, as the problem comes back again.
>>>>
>>>> A portion of our management log for the latest occurance of the problem
>>>> is attached to this email.
>>>>
>>>> Greetings,
>>>>
>>>> Melanie
>>>>
>>>>>
>>>>> On Wed, Jun 20, 2018 at 1:53 PM, Melanie Desaive <
>>>>> m.desaive@heinlein-support.de> wrote:
>>>>>
>>>>>> Hi all,
>>>>>>
>>>>>> we have a recurring problem with our virtual routers. By the log
>>>>>> messages it seems that com.cloud.agent.api.CheckRouterCommand runs
>> into
>>>>>> a timeout and therefore switches to UNKNOWN.
>>>>>>
>>>>>> All network traffic through the routers is still working. They can be
>>>>>> accessed by their link-local IP adresses, and configuration looks good
>>>>>> at a first sight. But configuration changes through the CloudStack API
>>>>>> do no longer reach the routers. A reboot fixes the problem.
>>>>>>
>>>>>> I would like to investigate a little further but lack understanding
>>>>>> about how the checkRouter command is trying to access the virtual
>>>> router.
>>>>>>
>>>>>> Could someone point me to some relevant documentation or give a short
>>>>>> overview how the connection from CS-Management is done and where such
>> an
>>>>>> timeout could occur?
>>>>>>
>>>>>> As background information - the sequence from the management log looks
>>>>>> kind of this:
>>>>>>
>>>>>> ---
>>>>>>
>>>>>>  x Every few seconds the com.cloud.agent.api.CheckRouterCommand
>> returns
>>>>>> a state BACKUP or MASTER correctly
>>>>>>  x When the problem occurs the log messages change. Some snippets
>> below
>>>>>>
>>>>>>  x ... Waiting some more time because this is the current command
>>>>>>  x ... Waiting some more time because this is the current command
>>>>>>  x Could not find exception:
>>>>>> com.cloud.exception.OperationTimedoutException in error code list for
>>>>>> exceptions
>>>>>>  x Timed out on Seq 28-2352567855348137104
>>>>>>  x Seq 28-2352567855348137104: Cancelling.
>>>>>>  x Operation timed out: Commands 2352567855348137104 to Host 28 timed
>>>>>> out after 60
>>>>>>  x Unable to update router r-2594-VM's status
>>>>>>  x Redundant virtual router (name: r-2594-VM, id: 2594)  just switch
>>>>>> from MASTER to UNKNOWN
>>>>>>
>>>>>>  x Those error messages are now repeated for each following
>>>>>> CheckRouterCommand until the virtual router is rebootet
>>>>>>
>>>>>>
>>>>>> Greetings,
>>>>>>
>>>>>> Melanie
>>>>>>
>>>>>> --
>>>>>> --
>>>>>>
>>>>>> Heinlein Support GmbH
>>>>>> Linux: Akademie - Support - Hosting
>>>>>>
>>>>>> http://www.heinlein-support.de
>>>>>> Tel: 030 / 40 50 51 - 0
>>>>>> Fax: 030 / 40 50 51 - 19
>>>>>>
>>>>>> Zwangsangaben lt. §35a GmbHG:
>>>>>> HRB 93818 B / Amtsgericht Berlin-Charlottenburg,
>>>>>> Geschäftsführer: Peer Heinlein  -- Sitz: Berlin
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>>> --
>>>> --
>>>>
>>>> Heinlein Support GmbH
>>>> Linux: Akademie - Support - Hosting
>>>>
>>>> http://www.heinlein-support.de
>>>> Tel: 030 / 40 50 51 - 0
>>>> Fax: 030 / 40 50 51 - 19
>>>>
>>>> Zwangsangaben lt. §35a GmbHG:
>>>> HRB 93818 B / Amtsgericht Berlin-Charlottenburg,
>>>> Geschäftsführer: Peer Heinlein  -- Sitz: Berlin
>>>>
>>>
>>>
>>>
>>
>> --
>> --
>>
>> Heinlein Support GmbH
>> Linux: Akademie - Support - Hosting
>>
>> http://www.heinlein-support.de
>> Tel: 030 / 40 50 51 - 0
>> Fax: 030 / 40 50 51 - 19
>>
>> Zwangsangaben lt. §35a GmbHG:
>> HRB 93818 B / Amtsgericht Berlin-Charlottenburg,
>> Geschäftsführer: Peer Heinlein  -- Sitz: Berlin
>>
> 
> 
> 

-- 
--

Heinlein Support GmbH
Linux: Akademie - Support - Hosting

http://www.heinlein-support.de
Tel: 030 / 40 50 51 - 0
Fax: 030 / 40 50 51 - 19

Zwangsangaben lt. §35a GmbHG:
HRB 93818 B / Amtsgericht Berlin-Charlottenburg,
Geschäftsführer: Peer Heinlein  -- Sitz: Berlin

Re: com.cloud.agent.api.CheckRouterCommand timeout

Posted by Daan Hoogland <da...@gmail.com>.
makes sense, well let's hope all breaks soon ;)

On Thu, Jun 21, 2018 at 2:15 PM, Melanie Desaive <
m.desaive@heinlein-support.de> wrote:

> Hi Daan,
>
> Am 21.06.2018 um 15:29 schrieb Daan Hoogland:
> > Melanie, attachments get deleted for this list. Your assumption for the
> > comm path is right for xen. Did you try and execute the script as it is
> > called by the proxy script from the host? and capture the return? We had
> a
> > bad problem with getting the template version in the past on xen, this
> > might be similar. That was due to processing of the returned string in
> the
> > script.
>
> I called both stages of the script manually but at at time, when all was
> working as expected and the routers where back to MASTER and BACKUP.
>
> Looked like:
>
> [root@acs-compute-5 ~]# /opt/cloud/bin/router_proxy.sh checkrouter.sh
> 169.254.1.178
> Status: BACKUP
>
> root@r-2595-VM:~# /opt/cloud/bin/checkrouter.sh
> Status: BACKUP
>
>
> >
> > On Thu, Jun 21, 2018 at 1:16 PM, Melanie Desaive <
> > m.desaive@heinlein-support.de> wrote:
> >
> >> Hi Daan,
> >>
> >> thanks for your reply.
> >>
> >> The latest occurance of our VRs going to UNKNOWN did resolve 24 hours
> >> after it had occured. Nevertheless I would appreciate some insight into
> >> how the checkRouter command is handled, as I expect the problem to come
> >> back again.
> >> Am 21.06.2018 um 10:39 schrieb Daan Hoogland:
> >>> Melanie, this depends a bit on the type of hypervisor. The command
> >> executes
> >>> the checkrouter.sh script on the virtual router if it reaches it, but
> it
> >>> seems your problem is before that. I would look at the network first
> and
> >>> follow the path that the execution takes for your hypervisortype.
> >>
> >> With Stephans help I figured out the following guess for the path of
> >> connections for the checkrouter command. Could someone please correct
> >> me, if my guess is not correct. ;)
> >>
> >>  x Management Nodes connects to XenServer hypervisor host via management
> >> network on port 22 by SSH
> >>  x On hypervisor host the wrapper script
> >> "/opt/cloud/bin/router_proxy.sh" is used to call scripts on system VMs
> >> via link-local IP and port 3922
> >>  x On the VR the script "/opt/cloud/bin/checkrouter.sh" does the actual
> >> check.
> >>
> >> In our case the API call times out with log messages
> >>  x Operation timed out: Commands 1063975411966525473 to Host 29 timed
> >> out after 60
> >>  x Unable to update router r-2595-VM's status
> >>  x Redundant virtual router (name: r-2595-VM, id: 2595)  just switch
> >> from BACKUP to UNKNOWN
> >>
> >> To me it seems that this is a timeout that occurs when ACS management is
> >> waitig for the API call to return. At what stage (management host <->
> >> virtualization host) or (virutalization host <-> VR> the answer is
> >> delayed is unclear to me. (SSH Login from virtualization host to VR via
> >> link-local is working all the time)
> >>
> >> And it is unclear to me, why both VRs of the respective network stay in
> >> UNKNOWN for 24 hours, are accessible via link-local but come back
> >> immedately after a reboot.
> >>
> >> I am happy for any suggestions or explanations in this topic and will
> >> investigate further as soon, as the problem comes back again.
> >>
> >> A portion of our management log for the latest occurance of the problem
> >> is attached to this email.
> >>
> >> Greetings,
> >>
> >> Melanie
> >>
> >>>
> >>> On Wed, Jun 20, 2018 at 1:53 PM, Melanie Desaive <
> >>> m.desaive@heinlein-support.de> wrote:
> >>>
> >>>> Hi all,
> >>>>
> >>>> we have a recurring problem with our virtual routers. By the log
> >>>> messages it seems that com.cloud.agent.api.CheckRouterCommand runs
> into
> >>>> a timeout and therefore switches to UNKNOWN.
> >>>>
> >>>> All network traffic through the routers is still working. They can be
> >>>> accessed by their link-local IP adresses, and configuration looks good
> >>>> at a first sight. But configuration changes through the CloudStack API
> >>>> do no longer reach the routers. A reboot fixes the problem.
> >>>>
> >>>> I would like to investigate a little further but lack understanding
> >>>> about how the checkRouter command is trying to access the virtual
> >> router.
> >>>>
> >>>> Could someone point me to some relevant documentation or give a short
> >>>> overview how the connection from CS-Management is done and where such
> an
> >>>> timeout could occur?
> >>>>
> >>>> As background information - the sequence from the management log looks
> >>>> kind of this:
> >>>>
> >>>> ---
> >>>>
> >>>>  x Every few seconds the com.cloud.agent.api.CheckRouterCommand
> returns
> >>>> a state BACKUP or MASTER correctly
> >>>>  x When the problem occurs the log messages change. Some snippets
> below
> >>>>
> >>>>  x ... Waiting some more time because this is the current command
> >>>>  x ... Waiting some more time because this is the current command
> >>>>  x Could not find exception:
> >>>> com.cloud.exception.OperationTimedoutException in error code list for
> >>>> exceptions
> >>>>  x Timed out on Seq 28-2352567855348137104
> >>>>  x Seq 28-2352567855348137104: Cancelling.
> >>>>  x Operation timed out: Commands 2352567855348137104 to Host 28 timed
> >>>> out after 60
> >>>>  x Unable to update router r-2594-VM's status
> >>>>  x Redundant virtual router (name: r-2594-VM, id: 2594)  just switch
> >>>> from MASTER to UNKNOWN
> >>>>
> >>>>  x Those error messages are now repeated for each following
> >>>> CheckRouterCommand until the virtual router is rebootet
> >>>>
> >>>>
> >>>> Greetings,
> >>>>
> >>>> Melanie
> >>>>
> >>>> --
> >>>> --
> >>>>
> >>>> Heinlein Support GmbH
> >>>> Linux: Akademie - Support - Hosting
> >>>>
> >>>> http://www.heinlein-support.de
> >>>> Tel: 030 / 40 50 51 - 0
> >>>> Fax: 030 / 40 50 51 - 19
> >>>>
> >>>> Zwangsangaben lt. §35a GmbHG:
> >>>> HRB 93818 B / Amtsgericht Berlin-Charlottenburg,
> >>>> Geschäftsführer: Peer Heinlein  -- Sitz: Berlin
> >>>>
> >>>
> >>>
> >>>
> >>
> >> --
> >> --
> >>
> >> Heinlein Support GmbH
> >> Linux: Akademie - Support - Hosting
> >>
> >> http://www.heinlein-support.de
> >> Tel: 030 / 40 50 51 - 0
> >> Fax: 030 / 40 50 51 - 19
> >>
> >> Zwangsangaben lt. §35a GmbHG:
> >> HRB 93818 B / Amtsgericht Berlin-Charlottenburg,
> >> Geschäftsführer: Peer Heinlein  -- Sitz: Berlin
> >>
> >
> >
> >
>
> --
> --
>
> Heinlein Support GmbH
> Linux: Akademie - Support - Hosting
>
> http://www.heinlein-support.de
> Tel: 030 / 40 50 51 - 0
> Fax: 030 / 40 50 51 - 19
>
> Zwangsangaben lt. §35a GmbHG:
> HRB 93818 B / Amtsgericht Berlin-Charlottenburg,
> Geschäftsführer: Peer Heinlein  -- Sitz: Berlin
>



-- 
Daan

Re: com.cloud.agent.api.CheckRouterCommand timeout

Posted by Melanie Desaive <m....@heinlein-support.de>.
Hi Daan,

Am 21.06.2018 um 15:29 schrieb Daan Hoogland:
> Melanie, attachments get deleted for this list. Your assumption for the
> comm path is right for xen. Did you try and execute the script as it is
> called by the proxy script from the host? and capture the return? We had a
> bad problem with getting the template version in the past on xen, this
> might be similar. That was due to processing of the returned string in the
> script.

I called both stages of the script manually but at at time, when all was
working as expected and the routers where back to MASTER and BACKUP.

Looked like:

[root@acs-compute-5 ~]# /opt/cloud/bin/router_proxy.sh checkrouter.sh
169.254.1.178
Status: BACKUP

root@r-2595-VM:~# /opt/cloud/bin/checkrouter.sh
Status: BACKUP


> 
> On Thu, Jun 21, 2018 at 1:16 PM, Melanie Desaive <
> m.desaive@heinlein-support.de> wrote:
> 
>> Hi Daan,
>>
>> thanks for your reply.
>>
>> The latest occurance of our VRs going to UNKNOWN did resolve 24 hours
>> after it had occured. Nevertheless I would appreciate some insight into
>> how the checkRouter command is handled, as I expect the problem to come
>> back again.
>> Am 21.06.2018 um 10:39 schrieb Daan Hoogland:
>>> Melanie, this depends a bit on the type of hypervisor. The command
>> executes
>>> the checkrouter.sh script on the virtual router if it reaches it, but it
>>> seems your problem is before that. I would look at the network first and
>>> follow the path that the execution takes for your hypervisortype.
>>
>> With Stephans help I figured out the following guess for the path of
>> connections for the checkrouter command. Could someone please correct
>> me, if my guess is not correct. ;)
>>
>>  x Management Nodes connects to XenServer hypervisor host via management
>> network on port 22 by SSH
>>  x On hypervisor host the wrapper script
>> "/opt/cloud/bin/router_proxy.sh" is used to call scripts on system VMs
>> via link-local IP and port 3922
>>  x On the VR the script "/opt/cloud/bin/checkrouter.sh" does the actual
>> check.
>>
>> In our case the API call times out with log messages
>>  x Operation timed out: Commands 1063975411966525473 to Host 29 timed
>> out after 60
>>  x Unable to update router r-2595-VM's status
>>  x Redundant virtual router (name: r-2595-VM, id: 2595)  just switch
>> from BACKUP to UNKNOWN
>>
>> To me it seems that this is a timeout that occurs when ACS management is
>> waitig for the API call to return. At what stage (management host <->
>> virtualization host) or (virutalization host <-> VR> the answer is
>> delayed is unclear to me. (SSH Login from virtualization host to VR via
>> link-local is working all the time)
>>
>> And it is unclear to me, why both VRs of the respective network stay in
>> UNKNOWN for 24 hours, are accessible via link-local but come back
>> immedately after a reboot.
>>
>> I am happy for any suggestions or explanations in this topic and will
>> investigate further as soon, as the problem comes back again.
>>
>> A portion of our management log for the latest occurance of the problem
>> is attached to this email.
>>
>> Greetings,
>>
>> Melanie
>>
>>>
>>> On Wed, Jun 20, 2018 at 1:53 PM, Melanie Desaive <
>>> m.desaive@heinlein-support.de> wrote:
>>>
>>>> Hi all,
>>>>
>>>> we have a recurring problem with our virtual routers. By the log
>>>> messages it seems that com.cloud.agent.api.CheckRouterCommand runs into
>>>> a timeout and therefore switches to UNKNOWN.
>>>>
>>>> All network traffic through the routers is still working. They can be
>>>> accessed by their link-local IP adresses, and configuration looks good
>>>> at a first sight. But configuration changes through the CloudStack API
>>>> do no longer reach the routers. A reboot fixes the problem.
>>>>
>>>> I would like to investigate a little further but lack understanding
>>>> about how the checkRouter command is trying to access the virtual
>> router.
>>>>
>>>> Could someone point me to some relevant documentation or give a short
>>>> overview how the connection from CS-Management is done and where such an
>>>> timeout could occur?
>>>>
>>>> As background information - the sequence from the management log looks
>>>> kind of this:
>>>>
>>>> ---
>>>>
>>>>  x Every few seconds the com.cloud.agent.api.CheckRouterCommand returns
>>>> a state BACKUP or MASTER correctly
>>>>  x When the problem occurs the log messages change. Some snippets below
>>>>
>>>>  x ... Waiting some more time because this is the current command
>>>>  x ... Waiting some more time because this is the current command
>>>>  x Could not find exception:
>>>> com.cloud.exception.OperationTimedoutException in error code list for
>>>> exceptions
>>>>  x Timed out on Seq 28-2352567855348137104
>>>>  x Seq 28-2352567855348137104: Cancelling.
>>>>  x Operation timed out: Commands 2352567855348137104 to Host 28 timed
>>>> out after 60
>>>>  x Unable to update router r-2594-VM's status
>>>>  x Redundant virtual router (name: r-2594-VM, id: 2594)  just switch
>>>> from MASTER to UNKNOWN
>>>>
>>>>  x Those error messages are now repeated for each following
>>>> CheckRouterCommand until the virtual router is rebootet
>>>>
>>>>
>>>> Greetings,
>>>>
>>>> Melanie
>>>>
>>>> --
>>>> --
>>>>
>>>> Heinlein Support GmbH
>>>> Linux: Akademie - Support - Hosting
>>>>
>>>> http://www.heinlein-support.de
>>>> Tel: 030 / 40 50 51 - 0
>>>> Fax: 030 / 40 50 51 - 19
>>>>
>>>> Zwangsangaben lt. §35a GmbHG:
>>>> HRB 93818 B / Amtsgericht Berlin-Charlottenburg,
>>>> Geschäftsführer: Peer Heinlein  -- Sitz: Berlin
>>>>
>>>
>>>
>>>
>>
>> --
>> --
>>
>> Heinlein Support GmbH
>> Linux: Akademie - Support - Hosting
>>
>> http://www.heinlein-support.de
>> Tel: 030 / 40 50 51 - 0
>> Fax: 030 / 40 50 51 - 19
>>
>> Zwangsangaben lt. §35a GmbHG:
>> HRB 93818 B / Amtsgericht Berlin-Charlottenburg,
>> Geschäftsführer: Peer Heinlein  -- Sitz: Berlin
>>
> 
> 
> 

-- 
--

Heinlein Support GmbH
Linux: Akademie - Support - Hosting

http://www.heinlein-support.de
Tel: 030 / 40 50 51 - 0
Fax: 030 / 40 50 51 - 19

Zwangsangaben lt. §35a GmbHG:
HRB 93818 B / Amtsgericht Berlin-Charlottenburg,
Geschäftsführer: Peer Heinlein  -- Sitz: Berlin

Re: com.cloud.agent.api.CheckRouterCommand timeout

Posted by Daan Hoogland <da...@gmail.com>.
Melanie, attachments get deleted for this list. Your assumption for the
comm path is right for xen. Did you try and execute the script as it is
called by the proxy script from the host? and capture the return? We had a
bad problem with getting the template version in the past on xen, this
might be similar. That was due to processing of the returned string in the
script.

On Thu, Jun 21, 2018 at 1:16 PM, Melanie Desaive <
m.desaive@heinlein-support.de> wrote:

> Hi Daan,
>
> thanks for your reply.
>
> The latest occurance of our VRs going to UNKNOWN did resolve 24 hours
> after it had occured. Nevertheless I would appreciate some insight into
> how the checkRouter command is handled, as I expect the problem to come
> back again.
> Am 21.06.2018 um 10:39 schrieb Daan Hoogland:
> > Melanie, this depends a bit on the type of hypervisor. The command
> executes
> > the checkrouter.sh script on the virtual router if it reaches it, but it
> > seems your problem is before that. I would look at the network first and
> > follow the path that the execution takes for your hypervisortype.
>
> With Stephans help I figured out the following guess for the path of
> connections for the checkrouter command. Could someone please correct
> me, if my guess is not correct. ;)
>
>  x Management Nodes connects to XenServer hypervisor host via management
> network on port 22 by SSH
>  x On hypervisor host the wrapper script
> "/opt/cloud/bin/router_proxy.sh" is used to call scripts on system VMs
> via link-local IP and port 3922
>  x On the VR the script "/opt/cloud/bin/checkrouter.sh" does the actual
> check.
>
> In our case the API call times out with log messages
>  x Operation timed out: Commands 1063975411966525473 to Host 29 timed
> out after 60
>  x Unable to update router r-2595-VM's status
>  x Redundant virtual router (name: r-2595-VM, id: 2595)  just switch
> from BACKUP to UNKNOWN
>
> To me it seems that this is a timeout that occurs when ACS management is
> waitig for the API call to return. At what stage (management host <->
> virtualization host) or (virutalization host <-> VR> the answer is
> delayed is unclear to me. (SSH Login from virtualization host to VR via
> link-local is working all the time)
>
> And it is unclear to me, why both VRs of the respective network stay in
> UNKNOWN for 24 hours, are accessible via link-local but come back
> immedately after a reboot.
>
> I am happy for any suggestions or explanations in this topic and will
> investigate further as soon, as the problem comes back again.
>
> A portion of our management log for the latest occurance of the problem
> is attached to this email.
>
> Greetings,
>
> Melanie
>
> >
> > On Wed, Jun 20, 2018 at 1:53 PM, Melanie Desaive <
> > m.desaive@heinlein-support.de> wrote:
> >
> >> Hi all,
> >>
> >> we have a recurring problem with our virtual routers. By the log
> >> messages it seems that com.cloud.agent.api.CheckRouterCommand runs into
> >> a timeout and therefore switches to UNKNOWN.
> >>
> >> All network traffic through the routers is still working. They can be
> >> accessed by their link-local IP adresses, and configuration looks good
> >> at a first sight. But configuration changes through the CloudStack API
> >> do no longer reach the routers. A reboot fixes the problem.
> >>
> >> I would like to investigate a little further but lack understanding
> >> about how the checkRouter command is trying to access the virtual
> router.
> >>
> >> Could someone point me to some relevant documentation or give a short
> >> overview how the connection from CS-Management is done and where such an
> >> timeout could occur?
> >>
> >> As background information - the sequence from the management log looks
> >> kind of this:
> >>
> >> ---
> >>
> >>  x Every few seconds the com.cloud.agent.api.CheckRouterCommand returns
> >> a state BACKUP or MASTER correctly
> >>  x When the problem occurs the log messages change. Some snippets below
> >>
> >>  x ... Waiting some more time because this is the current command
> >>  x ... Waiting some more time because this is the current command
> >>  x Could not find exception:
> >> com.cloud.exception.OperationTimedoutException in error code list for
> >> exceptions
> >>  x Timed out on Seq 28-2352567855348137104
> >>  x Seq 28-2352567855348137104: Cancelling.
> >>  x Operation timed out: Commands 2352567855348137104 to Host 28 timed
> >> out after 60
> >>  x Unable to update router r-2594-VM's status
> >>  x Redundant virtual router (name: r-2594-VM, id: 2594)  just switch
> >> from MASTER to UNKNOWN
> >>
> >>  x Those error messages are now repeated for each following
> >> CheckRouterCommand until the virtual router is rebootet
> >>
> >>
> >> Greetings,
> >>
> >> Melanie
> >>
> >> --
> >> --
> >>
> >> Heinlein Support GmbH
> >> Linux: Akademie - Support - Hosting
> >>
> >> http://www.heinlein-support.de
> >> Tel: 030 / 40 50 51 - 0
> >> Fax: 030 / 40 50 51 - 19
> >>
> >> Zwangsangaben lt. §35a GmbHG:
> >> HRB 93818 B / Amtsgericht Berlin-Charlottenburg,
> >> Geschäftsführer: Peer Heinlein  -- Sitz: Berlin
> >>
> >
> >
> >
>
> --
> --
>
> Heinlein Support GmbH
> Linux: Akademie - Support - Hosting
>
> http://www.heinlein-support.de
> Tel: 030 / 40 50 51 - 0
> Fax: 030 / 40 50 51 - 19
>
> Zwangsangaben lt. §35a GmbHG:
> HRB 93818 B / Amtsgericht Berlin-Charlottenburg,
> Geschäftsführer: Peer Heinlein  -- Sitz: Berlin
>



-- 
Daan

Re: com.cloud.agent.api.CheckRouterCommand timeout

Posted by Melanie Desaive <m....@heinlein-support.de>.
Hi Daan,

thanks for your reply.

The latest occurance of our VRs going to UNKNOWN did resolve 24 hours
after it had occured. Nevertheless I would appreciate some insight into
how the checkRouter command is handled, as I expect the problem to come
back again.
Am 21.06.2018 um 10:39 schrieb Daan Hoogland:
> Melanie, this depends a bit on the type of hypervisor. The command executes
> the checkrouter.sh script on the virtual router if it reaches it, but it
> seems your problem is before that. I would look at the network first and
> follow the path that the execution takes for your hypervisortype.

With Stephans help I figured out the following guess for the path of
connections for the checkrouter command. Could someone please correct
me, if my guess is not correct. ;)

 x Management Nodes connects to XenServer hypervisor host via management
network on port 22 by SSH
 x On hypervisor host the wrapper script
"/opt/cloud/bin/router_proxy.sh" is used to call scripts on system VMs
via link-local IP and port 3922
 x On the VR the script "/opt/cloud/bin/checkrouter.sh" does the actual
check.

In our case the API call times out with log messages
 x Operation timed out: Commands 1063975411966525473 to Host 29 timed
out after 60
 x Unable to update router r-2595-VM's status
 x Redundant virtual router (name: r-2595-VM, id: 2595)  just switch
from BACKUP to UNKNOWN

To me it seems that this is a timeout that occurs when ACS management is
waitig for the API call to return. At what stage (management host <->
virtualization host) or (virutalization host <-> VR> the answer is
delayed is unclear to me. (SSH Login from virtualization host to VR via
link-local is working all the time)

And it is unclear to me, why both VRs of the respective network stay in
UNKNOWN for 24 hours, are accessible via link-local but come back
immedately after a reboot.

I am happy for any suggestions or explanations in this topic and will
investigate further as soon, as the problem comes back again.

A portion of our management log for the latest occurance of the problem
is attached to this email.

Greetings,

Melanie

> 
> On Wed, Jun 20, 2018 at 1:53 PM, Melanie Desaive <
> m.desaive@heinlein-support.de> wrote:
> 
>> Hi all,
>>
>> we have a recurring problem with our virtual routers. By the log
>> messages it seems that com.cloud.agent.api.CheckRouterCommand runs into
>> a timeout and therefore switches to UNKNOWN.
>>
>> All network traffic through the routers is still working. They can be
>> accessed by their link-local IP adresses, and configuration looks good
>> at a first sight. But configuration changes through the CloudStack API
>> do no longer reach the routers. A reboot fixes the problem.
>>
>> I would like to investigate a little further but lack understanding
>> about how the checkRouter command is trying to access the virtual router.
>>
>> Could someone point me to some relevant documentation or give a short
>> overview how the connection from CS-Management is done and where such an
>> timeout could occur?
>>
>> As background information - the sequence from the management log looks
>> kind of this:
>>
>> ---
>>
>>  x Every few seconds the com.cloud.agent.api.CheckRouterCommand returns
>> a state BACKUP or MASTER correctly
>>  x When the problem occurs the log messages change. Some snippets below
>>
>>  x ... Waiting some more time because this is the current command
>>  x ... Waiting some more time because this is the current command
>>  x Could not find exception:
>> com.cloud.exception.OperationTimedoutException in error code list for
>> exceptions
>>  x Timed out on Seq 28-2352567855348137104
>>  x Seq 28-2352567855348137104: Cancelling.
>>  x Operation timed out: Commands 2352567855348137104 to Host 28 timed
>> out after 60
>>  x Unable to update router r-2594-VM's status
>>  x Redundant virtual router (name: r-2594-VM, id: 2594)  just switch
>> from MASTER to UNKNOWN
>>
>>  x Those error messages are now repeated for each following
>> CheckRouterCommand until the virtual router is rebootet
>>
>>
>> Greetings,
>>
>> Melanie
>>
>> --
>> --
>>
>> Heinlein Support GmbH
>> Linux: Akademie - Support - Hosting
>>
>> http://www.heinlein-support.de
>> Tel: 030 / 40 50 51 - 0
>> Fax: 030 / 40 50 51 - 19
>>
>> Zwangsangaben lt. §35a GmbHG:
>> HRB 93818 B / Amtsgericht Berlin-Charlottenburg,
>> Geschäftsführer: Peer Heinlein  -- Sitz: Berlin
>>
> 
> 
> 

-- 
--

Heinlein Support GmbH
Linux: Akademie - Support - Hosting

http://www.heinlein-support.de
Tel: 030 / 40 50 51 - 0
Fax: 030 / 40 50 51 - 19

Zwangsangaben lt. §35a GmbHG:
HRB 93818 B / Amtsgericht Berlin-Charlottenburg,
Geschäftsführer: Peer Heinlein  -- Sitz: Berlin

Re: com.cloud.agent.api.CheckRouterCommand timeout

Posted by Daan Hoogland <da...@gmail.com>.
Melanie, this depends a bit on the type of hypervisor. The command executes
the checkrouter.sh script on the virtual router if it reaches it, but it
seems your problem is before that. I would look at the network first and
follow the path that the execution takes for your hypervisortype.

On Wed, Jun 20, 2018 at 1:53 PM, Melanie Desaive <
m.desaive@heinlein-support.de> wrote:

> Hi all,
>
> we have a recurring problem with our virtual routers. By the log
> messages it seems that com.cloud.agent.api.CheckRouterCommand runs into
> a timeout and therefore switches to UNKNOWN.
>
> All network traffic through the routers is still working. They can be
> accessed by their link-local IP adresses, and configuration looks good
> at a first sight. But configuration changes through the CloudStack API
> do no longer reach the routers. A reboot fixes the problem.
>
> I would like to investigate a little further but lack understanding
> about how the checkRouter command is trying to access the virtual router.
>
> Could someone point me to some relevant documentation or give a short
> overview how the connection from CS-Management is done and where such an
> timeout could occur?
>
> As background information - the sequence from the management log looks
> kind of this:
>
> ---
>
>  x Every few seconds the com.cloud.agent.api.CheckRouterCommand returns
> a state BACKUP or MASTER correctly
>  x When the problem occurs the log messages change. Some snippets below
>
>  x ... Waiting some more time because this is the current command
>  x ... Waiting some more time because this is the current command
>  x Could not find exception:
> com.cloud.exception.OperationTimedoutException in error code list for
> exceptions
>  x Timed out on Seq 28-2352567855348137104
>  x Seq 28-2352567855348137104: Cancelling.
>  x Operation timed out: Commands 2352567855348137104 to Host 28 timed
> out after 60
>  x Unable to update router r-2594-VM's status
>  x Redundant virtual router (name: r-2594-VM, id: 2594)  just switch
> from MASTER to UNKNOWN
>
>  x Those error messages are now repeated for each following
> CheckRouterCommand until the virtual router is rebootet
>
>
> Greetings,
>
> Melanie
>
> --
> --
>
> Heinlein Support GmbH
> Linux: Akademie - Support - Hosting
>
> http://www.heinlein-support.de
> Tel: 030 / 40 50 51 - 0
> Fax: 030 / 40 50 51 - 19
>
> Zwangsangaben lt. §35a GmbHG:
> HRB 93818 B / Amtsgericht Berlin-Charlottenburg,
> Geschäftsführer: Peer Heinlein  -- Sitz: Berlin
>



-- 
Daan