You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@cloudstack.apache.org by Edison Su <Ed...@citrix.com> on 2013/07/15 23:42:32 UTC

How to fix libvirt storage pool refresh issue?

There is a serious issue on KVM(https://issues.apache.org/jira/browse/CLOUDSTACK-2729): a libvirt storage pool can disappear on KVM host, it's easy to be reproduced in our internal QA environment.
Wei found the root cause, is on the libvirt:
"
This is a libvirt issue. I created a ticket for it.
https://bugzilla.redhat.com/show_bug.cgi?id=977706
The patch is very simple.
https://www.redhat.com/archives/libvir-list/2013-July/msg00635.html
"
But it's also introduced by CloudStack, as cloudstack will call libvirt storage pool refresh method each time when access the storage pool. The code is added by commit: 2ffc9907f7b0d371737e39b7649f7af23026f5cf, about less than one year ago.

As Wei suggested, we can call storage pool refresh only if needed, it will mitigate the issue(It's behavior I did on cloudstack pre-4.0), but it's only treat the symptom, not the cause.
Or add a cluster wide lock, only one guy can access storage pool at one time, we can add a file lock on NFS primary storage.
Any idea/feedback on how to fix this KVM issue?




Re: How to fix libvirt storage pool refresh issue?

Posted by Wei ZHOU <us...@gmail.com>.
Edison,

Please review the patch: https://reviews.apache.org/r/13223/

-Wei


2013/8/2 Wei ZHOU <us...@gmail.com>

> Alex,
>
> Exactly.
>
> We can also use Enable Maintenance -> umount nfs point -> restart
> cloudstack-agent -> Cancel Maintenance to solve this issue.
>
> -Wei
>
>
> 2013/8/2 Alex Huang <Al...@citrix.com>
>
>> So I have very limited knowledge on KVM.  But, from my understanding from
>> Edison, we should consider what has to be done to fix this problem once it
>> occurs.
>>
>> - Shutdown all VMs on all hosts that are affected.
>> - umount the nfs mount point
>> - Reestablish the storage pool.
>> - Restart the VMs.
>>
>> Given how severe these actions are to the end user, I would vote for the
>> file lock to ensure it never happens, even if it's slower.
>>
>> --Alex
>>
>> > -----Original Message-----
>> > From: Wei ZHOU [mailto:ustcweizhou@gmail.com]
>> > Sent: Tuesday, July 16, 2013 3:35 AM
>> > To: dev@cloudstack.apache.org
>> > Subject: Re: How to fix libvirt storage pool refresh issue?
>> >
>> > I agree with Wido.
>> >
>> > Moreover, the file lock will cause performane degrade of VM deployment.
>> >
>> > -Wei
>> >
>> >
>> > 2013/7/16 Wido den Hollander <wi...@widodh.nl>
>> >
>> > > On 07/16/2013 12:27 AM, Marcus Sorensen wrote:
>> > >
>> > >>     I'm ok with a symptom fix on our end, if the root cause is in
>> > >> Libvirt we can't do much about that. This is the sort of patch that
>> > >> tends to get pulled into the regular update cycle of the
>> > >> distributions, so unless there's more to it and it's not a good fix I
>> > >> imagine we will see it come through without having to wait for the
>> > >> next point releases. We still have to support existing users who
>> > >> might not be running the latest, though, so the symptom fix is
>> > >> probably ok as a temporary measure.
>> > >>
>> > >
>> > > I'm ok with not calling storagePoolRefresh every time we want a
>> > > capacity update, since that's also kind of I/O intensive for larger
>> storage
>> > arrays.
>> > >
>> > > However, we should make sure we have a GOOD comment in the code
>> > about
>> > > this "fix", since that's the reason I initially removed the old code
>> > > which invoked "df".
>> > >
>> > > I'll see if I can get this libvirt patch into Ubuntu when it hits
>> > > libvirt upstream, since this bug is really annoying.
>> > >
>> > > Wido
>> > >
>> > >
>> > >
>> > >> On Mon, Jul 15, 2013 at 3:42 PM, Edison Su <Ed...@citrix.com>
>> wrote:
>> > >>
>> > >>> There is a serious issue on KVM(https://issues.apache.org/**
>> > >>> jira/browse/CLOUDSTACK-
>> > 2729<https://issues.apache.org/jira/browse/CLOUDSTACK-2729>):
>> > >>> a libvirt storage pool can disappear on KVM host, it's easy to be
>> > >>> reproduced in our internal QA environment.
>> > >>> Wei found the root cause, is on the libvirt:
>> > >>> "
>> > >>> This is a libvirt issue. I created a ticket for it.
>> > >>> https://bugzilla.redhat.com/**show_bug.cgi?id=977706<
>> https://bugzill
>> > >>> a.redhat.com/show_bug.cgi?id=977706>
>> > >>> The patch is very simple.
>> > >>>
>> https://www.redhat.com/**archives/libvir-list/2013-**July/msg00635.h
>> > >>> tml<
>> https://www.redhat.com/archives/libvir-list/2013-July/msg00635.h
>> > >>> tml>
>> > >>> "
>> > >>> But it's also introduced by CloudStack, as cloudstack will call
>> > >>> libvirt storage pool refresh method each time when access the
>> > >>> storage pool. The code is added by commit:
>> > >>> 2ffc9907f7b0d371737e39b7649f7a**f23026f5cf,
>> > >>> about less than one year ago.
>> > >>>
>> > >>> As Wei suggested, we can call storage pool refresh only if needed,
>> > >>> it will mitigate the issue(It's behavior I did on cloudstack
>> > >>> pre-4.0), but it's only treat the symptom, not the cause.
>> > >>> Or add a cluster wide lock, only one guy can access storage pool at
>> > >>> one time, we can add a file lock on NFS primary storage.
>> > >>> Any idea/feedback on how to fix this KVM issue?
>> > >>>
>> > >>>
>> > >>>
>> > >>>
>> > >
>>
>
>

Re: How to fix libvirt storage pool refresh issue?

Posted by Wei ZHOU <us...@gmail.com>.
Alex,

Exactly.

We can also use Enable Maintenance -> umount nfs point -> restart
cloudstack-agent -> Cancel Maintenance to solve this issue.

-Wei


2013/8/2 Alex Huang <Al...@citrix.com>

> So I have very limited knowledge on KVM.  But, from my understanding from
> Edison, we should consider what has to be done to fix this problem once it
> occurs.
>
> - Shutdown all VMs on all hosts that are affected.
> - umount the nfs mount point
> - Reestablish the storage pool.
> - Restart the VMs.
>
> Given how severe these actions are to the end user, I would vote for the
> file lock to ensure it never happens, even if it's slower.
>
> --Alex
>
> > -----Original Message-----
> > From: Wei ZHOU [mailto:ustcweizhou@gmail.com]
> > Sent: Tuesday, July 16, 2013 3:35 AM
> > To: dev@cloudstack.apache.org
> > Subject: Re: How to fix libvirt storage pool refresh issue?
> >
> > I agree with Wido.
> >
> > Moreover, the file lock will cause performane degrade of VM deployment.
> >
> > -Wei
> >
> >
> > 2013/7/16 Wido den Hollander <wi...@widodh.nl>
> >
> > > On 07/16/2013 12:27 AM, Marcus Sorensen wrote:
> > >
> > >>     I'm ok with a symptom fix on our end, if the root cause is in
> > >> Libvirt we can't do much about that. This is the sort of patch that
> > >> tends to get pulled into the regular update cycle of the
> > >> distributions, so unless there's more to it and it's not a good fix I
> > >> imagine we will see it come through without having to wait for the
> > >> next point releases. We still have to support existing users who
> > >> might not be running the latest, though, so the symptom fix is
> > >> probably ok as a temporary measure.
> > >>
> > >
> > > I'm ok with not calling storagePoolRefresh every time we want a
> > > capacity update, since that's also kind of I/O intensive for larger
> storage
> > arrays.
> > >
> > > However, we should make sure we have a GOOD comment in the code
> > about
> > > this "fix", since that's the reason I initially removed the old code
> > > which invoked "df".
> > >
> > > I'll see if I can get this libvirt patch into Ubuntu when it hits
> > > libvirt upstream, since this bug is really annoying.
> > >
> > > Wido
> > >
> > >
> > >
> > >> On Mon, Jul 15, 2013 at 3:42 PM, Edison Su <Ed...@citrix.com>
> wrote:
> > >>
> > >>> There is a serious issue on KVM(https://issues.apache.org/**
> > >>> jira/browse/CLOUDSTACK-
> > 2729<https://issues.apache.org/jira/browse/CLOUDSTACK-2729>):
> > >>> a libvirt storage pool can disappear on KVM host, it's easy to be
> > >>> reproduced in our internal QA environment.
> > >>> Wei found the root cause, is on the libvirt:
> > >>> "
> > >>> This is a libvirt issue. I created a ticket for it.
> > >>> https://bugzilla.redhat.com/**show_bug.cgi?id=977706<https://bugzill
> > >>> a.redhat.com/show_bug.cgi?id=977706>
> > >>> The patch is very simple.
> > >>> https://www.redhat.com/**archives/libvir-list/2013-**July/msg00635.h
> > >>> tml<https://www.redhat.com/archives/libvir-list/2013-July/msg00635.h
> > >>> tml>
> > >>> "
> > >>> But it's also introduced by CloudStack, as cloudstack will call
> > >>> libvirt storage pool refresh method each time when access the
> > >>> storage pool. The code is added by commit:
> > >>> 2ffc9907f7b0d371737e39b7649f7a**f23026f5cf,
> > >>> about less than one year ago.
> > >>>
> > >>> As Wei suggested, we can call storage pool refresh only if needed,
> > >>> it will mitigate the issue(It's behavior I did on cloudstack
> > >>> pre-4.0), but it's only treat the symptom, not the cause.
> > >>> Or add a cluster wide lock, only one guy can access storage pool at
> > >>> one time, we can add a file lock on NFS primary storage.
> > >>> Any idea/feedback on how to fix this KVM issue?
> > >>>
> > >>>
> > >>>
> > >>>
> > >
>

RE: How to fix libvirt storage pool refresh issue?

Posted by Alex Huang <Al...@citrix.com>.
So I have very limited knowledge on KVM.  But, from my understanding from Edison, we should consider what has to be done to fix this problem once it occurs.

- Shutdown all VMs on all hosts that are affected.
- umount the nfs mount point
- Reestablish the storage pool.
- Restart the VMs.

Given how severe these actions are to the end user, I would vote for the file lock to ensure it never happens, even if it's slower.

--Alex

> -----Original Message-----
> From: Wei ZHOU [mailto:ustcweizhou@gmail.com]
> Sent: Tuesday, July 16, 2013 3:35 AM
> To: dev@cloudstack.apache.org
> Subject: Re: How to fix libvirt storage pool refresh issue?
> 
> I agree with Wido.
> 
> Moreover, the file lock will cause performane degrade of VM deployment.
> 
> -Wei
> 
> 
> 2013/7/16 Wido den Hollander <wi...@widodh.nl>
> 
> > On 07/16/2013 12:27 AM, Marcus Sorensen wrote:
> >
> >>     I'm ok with a symptom fix on our end, if the root cause is in
> >> Libvirt we can't do much about that. This is the sort of patch that
> >> tends to get pulled into the regular update cycle of the
> >> distributions, so unless there's more to it and it's not a good fix I
> >> imagine we will see it come through without having to wait for the
> >> next point releases. We still have to support existing users who
> >> might not be running the latest, though, so the symptom fix is
> >> probably ok as a temporary measure.
> >>
> >
> > I'm ok with not calling storagePoolRefresh every time we want a
> > capacity update, since that's also kind of I/O intensive for larger storage
> arrays.
> >
> > However, we should make sure we have a GOOD comment in the code
> about
> > this "fix", since that's the reason I initially removed the old code
> > which invoked "df".
> >
> > I'll see if I can get this libvirt patch into Ubuntu when it hits
> > libvirt upstream, since this bug is really annoying.
> >
> > Wido
> >
> >
> >
> >> On Mon, Jul 15, 2013 at 3:42 PM, Edison Su <Ed...@citrix.com> wrote:
> >>
> >>> There is a serious issue on KVM(https://issues.apache.org/**
> >>> jira/browse/CLOUDSTACK-
> 2729<https://issues.apache.org/jira/browse/CLOUDSTACK-2729>):
> >>> a libvirt storage pool can disappear on KVM host, it's easy to be
> >>> reproduced in our internal QA environment.
> >>> Wei found the root cause, is on the libvirt:
> >>> "
> >>> This is a libvirt issue. I created a ticket for it.
> >>> https://bugzilla.redhat.com/**show_bug.cgi?id=977706<https://bugzill
> >>> a.redhat.com/show_bug.cgi?id=977706>
> >>> The patch is very simple.
> >>> https://www.redhat.com/**archives/libvir-list/2013-**July/msg00635.h
> >>> tml<https://www.redhat.com/archives/libvir-list/2013-July/msg00635.h
> >>> tml>
> >>> "
> >>> But it's also introduced by CloudStack, as cloudstack will call
> >>> libvirt storage pool refresh method each time when access the
> >>> storage pool. The code is added by commit:
> >>> 2ffc9907f7b0d371737e39b7649f7a**f23026f5cf,
> >>> about less than one year ago.
> >>>
> >>> As Wei suggested, we can call storage pool refresh only if needed,
> >>> it will mitigate the issue(It's behavior I did on cloudstack
> >>> pre-4.0), but it's only treat the symptom, not the cause.
> >>> Or add a cluster wide lock, only one guy can access storage pool at
> >>> one time, we can add a file lock on NFS primary storage.
> >>> Any idea/feedback on how to fix this KVM issue?
> >>>
> >>>
> >>>
> >>>
> >

Re: How to fix libvirt storage pool refresh issue?

Posted by Wido den Hollander <wi...@widodh.nl>.

On 08/02/2013 01:22 PM, Wei ZHOU wrote:
> Wido,
>
> You applied the libvirt patch on your production system, and this issue
> disappeared, right?

Yes, I applied that patch and rebuild libvirt. The issue never came back.

> If so, that is good.
> I expect the redhat community can accep the patch (or the v2
> https://www.redhat.com/archives/libvir-list/2013-July/msg00639.html) ASAP.
>

I've been watching the thread, but it seems kind of dead.

Wido

>
> -Wei
>
>
> 2013/8/2 Wido den Hollander <wi...@widodh.nl>
>
>>
>>
>> On 08/02/2013 01:55 AM, Edison Su wrote:
>>
>>> Hi Wei, regarding to the bug CLOUDSTACK-2729, I removed storage.refresh
>>> during getStoragePool in LibvirtStorageAdaptor, but the issue still
>>> happened in BVT.
>>> I am thinking add file lock on primary storage, seems you already have
>>> the patch, could you share the patch with us?
>>>
>>>
>> Fyi, I fixed this by patching the libvirt on our production systems rather
>> then fixing the CloudStack agent.
>>
>> It's just one very small patch: https://bugzilla.redhat.com/**
>> show_bug.cgi?id=977706<https://bugzilla.redhat.com/show_bug.cgi?id=977706>
>>
>> https://www.redhat.com/**archives/libvir-list/2013-**July/msg00635.html<https://www.redhat.com/archives/libvir-list/2013-July/msg00635.html>
>>
>> Wido
>>
>>
>>   -----Original Message-----
>>>> From: Wei ZHOU [mailto:ustcweizhou@gmail.com]
>>>> Sent: Tuesday, July 16, 2013 3:35 AM
>>>> To: dev@cloudstack.apache.org
>>>> Subject: Re: How to fix libvirt storage pool refresh issue?
>>>>
>>>> I agree with Wido.
>>>>
>>>> Moreover, the file lock will cause performane degrade of VM deployment.
>>>>
>>>> -Wei
>>>>
>>>>
>>>> 2013/7/16 Wido den Hollander <wi...@widodh.nl>
>>>>
>>>>   On 07/16/2013 12:27 AM, Marcus Sorensen wrote:
>>>>>
>>>>>        I'm ok with a symptom fix on our end, if the root cause is in
>>>>>> Libvirt we can't do much about that. This is the sort of patch that
>>>>>> tends to get pulled into the regular update cycle of the
>>>>>> distributions, so unless there's more to it and it's not a good fix I
>>>>>> imagine we will see it come through without having to wait for the
>>>>>> next point releases. We still have to support existing users who
>>>>>> might not be running the latest, though, so the symptom fix is
>>>>>> probably ok as a temporary measure.
>>>>>>
>>>>>>
>>>>> I'm ok with not calling storagePoolRefresh every time we want a
>>>>> capacity update, since that's also kind of I/O intensive for larger
>>>>> storage
>>>>>
>>>> arrays.
>>>>
>>>>>
>>>>> However, we should make sure we have a GOOD comment in the code
>>>>>
>>>> about
>>>>
>>>>> this "fix", since that's the reason I initially removed the old code
>>>>> which invoked "df".
>>>>>
>>>>> I'll see if I can get this libvirt patch into Ubuntu when it hits
>>>>> libvirt upstream, since this bug is really annoying.
>>>>>
>>>>> Wido
>>>>>
>>>>>
>>>>>
>>>>>   On Mon, Jul 15, 2013 at 3:42 PM, Edison Su <Ed...@citrix.com>
>>>>>> wrote:
>>>>>>
>>>>>>   There is a serious issue on KVM(https://issues.apache.org/****<https://issues.apache.org/**>
>>>>>>> jira/browse/CLOUDSTACK-
>>>>>>>
>>>>>> 2729<https://issues.apache.**org/jira/browse/CLOUDSTACK-**2729<https://issues.apache.org/jira/browse/CLOUDSTACK-2729>
>>>>> ):
>>>>
>>>>> a libvirt storage pool can disappear on KVM host, it's easy to be
>>>>>>> reproduced in our internal QA environment.
>>>>>>> Wei found the root cause, is on the libvirt:
>>>>>>> "
>>>>>>> This is a libvirt issue. I created a ticket for it.
>>>>>>> https://bugzilla.redhat.com/****show_bug.cgi?id=977706<https://bugzilla.redhat.com/**show_bug.cgi?id=977706>
>>>>>>> <https:/**/bugzill <https://bugzill>
>>>>>>> a.redhat.com/show_bug.cgi?id=**977706<http://a.redhat.com/show_bug.cgi?id=977706>
>>>>>>>>
>>>>>>> The patch is very simple.
>>>>>>> https://www.redhat.com/****archives/libvir-list/2013-****
>>>>>>> July/msg00635.h<https://www.redhat.com/**archives/libvir-list/2013-**July/msg00635.h>
>>>>>>> tml<https://www.redhat.com/**archives/libvir-list/2013-**
>>>>>>> July/msg00635.h<https://www.redhat.com/archives/libvir-list/2013-July/msg00635.h>
>>>>>>> tml>
>>>>>>> "
>>>>>>> But it's also introduced by CloudStack, as cloudstack will call
>>>>>>> libvirt storage pool refresh method each time when access the
>>>>>>> storage pool. The code is added by commit:
>>>>>>> 2ffc9907f7b0d371737e39b7649f7a****f23026f5cf,
>>>>>>> about less than one year ago.
>>>>>>>
>>>>>>> As Wei suggested, we can call storage pool refresh only if needed,
>>>>>>> it will mitigate the issue(It's behavior I did on cloudstack
>>>>>>> pre-4.0), but it's only treat the symptom, not the cause.
>>>>>>> Or add a cluster wide lock, only one guy can access storage pool at
>>>>>>> one time, we can add a file lock on NFS primary storage.
>>>>>>> Any idea/feedback on how to fix this KVM issue?
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>
>

Re: How to fix libvirt storage pool refresh issue?

Posted by Wei ZHOU <us...@gmail.com>.
Wido,

You applied the libvirt patch on your production system, and this issue
disappeared, right?
If so, that is good.
I expect the redhat community can accep the patch (or the v2
https://www.redhat.com/archives/libvir-list/2013-July/msg00639.html) ASAP.


-Wei


2013/8/2 Wido den Hollander <wi...@widodh.nl>

>
>
> On 08/02/2013 01:55 AM, Edison Su wrote:
>
>> Hi Wei, regarding to the bug CLOUDSTACK-2729, I removed storage.refresh
>> during getStoragePool in LibvirtStorageAdaptor, but the issue still
>> happened in BVT.
>> I am thinking add file lock on primary storage, seems you already have
>> the patch, could you share the patch with us?
>>
>>
> Fyi, I fixed this by patching the libvirt on our production systems rather
> then fixing the CloudStack agent.
>
> It's just one very small patch: https://bugzilla.redhat.com/**
> show_bug.cgi?id=977706<https://bugzilla.redhat.com/show_bug.cgi?id=977706>
>
> https://www.redhat.com/**archives/libvir-list/2013-**July/msg00635.html<https://www.redhat.com/archives/libvir-list/2013-July/msg00635.html>
>
> Wido
>
>
>  -----Original Message-----
>>> From: Wei ZHOU [mailto:ustcweizhou@gmail.com]
>>> Sent: Tuesday, July 16, 2013 3:35 AM
>>> To: dev@cloudstack.apache.org
>>> Subject: Re: How to fix libvirt storage pool refresh issue?
>>>
>>> I agree with Wido.
>>>
>>> Moreover, the file lock will cause performane degrade of VM deployment.
>>>
>>> -Wei
>>>
>>>
>>> 2013/7/16 Wido den Hollander <wi...@widodh.nl>
>>>
>>>  On 07/16/2013 12:27 AM, Marcus Sorensen wrote:
>>>>
>>>>       I'm ok with a symptom fix on our end, if the root cause is in
>>>>> Libvirt we can't do much about that. This is the sort of patch that
>>>>> tends to get pulled into the regular update cycle of the
>>>>> distributions, so unless there's more to it and it's not a good fix I
>>>>> imagine we will see it come through without having to wait for the
>>>>> next point releases. We still have to support existing users who
>>>>> might not be running the latest, though, so the symptom fix is
>>>>> probably ok as a temporary measure.
>>>>>
>>>>>
>>>> I'm ok with not calling storagePoolRefresh every time we want a
>>>> capacity update, since that's also kind of I/O intensive for larger
>>>> storage
>>>>
>>> arrays.
>>>
>>>>
>>>> However, we should make sure we have a GOOD comment in the code
>>>>
>>> about
>>>
>>>> this "fix", since that's the reason I initially removed the old code
>>>> which invoked "df".
>>>>
>>>> I'll see if I can get this libvirt patch into Ubuntu when it hits
>>>> libvirt upstream, since this bug is really annoying.
>>>>
>>>> Wido
>>>>
>>>>
>>>>
>>>>  On Mon, Jul 15, 2013 at 3:42 PM, Edison Su <Ed...@citrix.com>
>>>>> wrote:
>>>>>
>>>>>  There is a serious issue on KVM(https://issues.apache.org/****<https://issues.apache.org/**>
>>>>>> jira/browse/CLOUDSTACK-
>>>>>>
>>>>> 2729<https://issues.apache.**org/jira/browse/CLOUDSTACK-**2729<https://issues.apache.org/jira/browse/CLOUDSTACK-2729>
>>> >):
>>>
>>>> a libvirt storage pool can disappear on KVM host, it's easy to be
>>>>>> reproduced in our internal QA environment.
>>>>>> Wei found the root cause, is on the libvirt:
>>>>>> "
>>>>>> This is a libvirt issue. I created a ticket for it.
>>>>>> https://bugzilla.redhat.com/****show_bug.cgi?id=977706<https://bugzilla.redhat.com/**show_bug.cgi?id=977706>
>>>>>> <https:/**/bugzill <https://bugzill>
>>>>>> a.redhat.com/show_bug.cgi?id=**977706<http://a.redhat.com/show_bug.cgi?id=977706>
>>>>>> >
>>>>>> The patch is very simple.
>>>>>> https://www.redhat.com/****archives/libvir-list/2013-****
>>>>>> July/msg00635.h<https://www.redhat.com/**archives/libvir-list/2013-**July/msg00635.h>
>>>>>> tml<https://www.redhat.com/**archives/libvir-list/2013-**
>>>>>> July/msg00635.h<https://www.redhat.com/archives/libvir-list/2013-July/msg00635.h>
>>>>>> tml>
>>>>>> "
>>>>>> But it's also introduced by CloudStack, as cloudstack will call
>>>>>> libvirt storage pool refresh method each time when access the
>>>>>> storage pool. The code is added by commit:
>>>>>> 2ffc9907f7b0d371737e39b7649f7a****f23026f5cf,
>>>>>> about less than one year ago.
>>>>>>
>>>>>> As Wei suggested, we can call storage pool refresh only if needed,
>>>>>> it will mitigate the issue(It's behavior I did on cloudstack
>>>>>> pre-4.0), but it's only treat the symptom, not the cause.
>>>>>> Or add a cluster wide lock, only one guy can access storage pool at
>>>>>> one time, we can add a file lock on NFS primary storage.
>>>>>> Any idea/feedback on how to fix this KVM issue?
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>

Re: How to fix libvirt storage pool refresh issue?

Posted by Wido den Hollander <wi...@widodh.nl>.

On 08/02/2013 01:55 AM, Edison Su wrote:
> Hi Wei, regarding to the bug CLOUDSTACK-2729, I removed storage.refresh during getStoragePool in LibvirtStorageAdaptor, but the issue still happened in BVT.
> I am thinking add file lock on primary storage, seems you already have the patch, could you share the patch with us?
>

Fyi, I fixed this by patching the libvirt on our production systems 
rather then fixing the CloudStack agent.

It's just one very small patch: 
https://bugzilla.redhat.com/show_bug.cgi?id=977706

https://www.redhat.com/archives/libvir-list/2013-July/msg00635.html

Wido

>> -----Original Message-----
>> From: Wei ZHOU [mailto:ustcweizhou@gmail.com]
>> Sent: Tuesday, July 16, 2013 3:35 AM
>> To: dev@cloudstack.apache.org
>> Subject: Re: How to fix libvirt storage pool refresh issue?
>>
>> I agree with Wido.
>>
>> Moreover, the file lock will cause performane degrade of VM deployment.
>>
>> -Wei
>>
>>
>> 2013/7/16 Wido den Hollander <wi...@widodh.nl>
>>
>>> On 07/16/2013 12:27 AM, Marcus Sorensen wrote:
>>>
>>>>      I'm ok with a symptom fix on our end, if the root cause is in
>>>> Libvirt we can't do much about that. This is the sort of patch that
>>>> tends to get pulled into the regular update cycle of the
>>>> distributions, so unless there's more to it and it's not a good fix I
>>>> imagine we will see it come through without having to wait for the
>>>> next point releases. We still have to support existing users who
>>>> might not be running the latest, though, so the symptom fix is
>>>> probably ok as a temporary measure.
>>>>
>>>
>>> I'm ok with not calling storagePoolRefresh every time we want a
>>> capacity update, since that's also kind of I/O intensive for larger storage
>> arrays.
>>>
>>> However, we should make sure we have a GOOD comment in the code
>> about
>>> this "fix", since that's the reason I initially removed the old code
>>> which invoked "df".
>>>
>>> I'll see if I can get this libvirt patch into Ubuntu when it hits
>>> libvirt upstream, since this bug is really annoying.
>>>
>>> Wido
>>>
>>>
>>>
>>>> On Mon, Jul 15, 2013 at 3:42 PM, Edison Su <Ed...@citrix.com> wrote:
>>>>
>>>>> There is a serious issue on KVM(https://issues.apache.org/**
>>>>> jira/browse/CLOUDSTACK-
>> 2729<https://issues.apache.org/jira/browse/CLOUDSTACK-2729>):
>>>>> a libvirt storage pool can disappear on KVM host, it's easy to be
>>>>> reproduced in our internal QA environment.
>>>>> Wei found the root cause, is on the libvirt:
>>>>> "
>>>>> This is a libvirt issue. I created a ticket for it.
>>>>> https://bugzilla.redhat.com/**show_bug.cgi?id=977706<https://bugzill
>>>>> a.redhat.com/show_bug.cgi?id=977706>
>>>>> The patch is very simple.
>>>>> https://www.redhat.com/**archives/libvir-list/2013-**July/msg00635.h
>>>>> tml<https://www.redhat.com/archives/libvir-list/2013-July/msg00635.h
>>>>> tml>
>>>>> "
>>>>> But it's also introduced by CloudStack, as cloudstack will call
>>>>> libvirt storage pool refresh method each time when access the
>>>>> storage pool. The code is added by commit:
>>>>> 2ffc9907f7b0d371737e39b7649f7a**f23026f5cf,
>>>>> about less than one year ago.
>>>>>
>>>>> As Wei suggested, we can call storage pool refresh only if needed,
>>>>> it will mitigate the issue(It's behavior I did on cloudstack
>>>>> pre-4.0), but it's only treat the symptom, not the cause.
>>>>> Or add a cluster wide lock, only one guy can access storage pool at
>>>>> one time, we can add a file lock on NFS primary storage.
>>>>> Any idea/feedback on how to fix this KVM issue?
>>>>>
>>>>>
>>>>>
>>>>>
>>>

RE: How to fix libvirt storage pool refresh issue?

Posted by Edison Su <Ed...@citrix.com>.
Hi Wei, regarding to the bug CLOUDSTACK-2729, I removed storage.refresh during getStoragePool in LibvirtStorageAdaptor, but the issue still happened in BVT.
I am thinking add file lock on primary storage, seems you already have the patch, could you share the patch with us?

> -----Original Message-----
> From: Wei ZHOU [mailto:ustcweizhou@gmail.com]
> Sent: Tuesday, July 16, 2013 3:35 AM
> To: dev@cloudstack.apache.org
> Subject: Re: How to fix libvirt storage pool refresh issue?
> 
> I agree with Wido.
> 
> Moreover, the file lock will cause performane degrade of VM deployment.
> 
> -Wei
> 
> 
> 2013/7/16 Wido den Hollander <wi...@widodh.nl>
> 
> > On 07/16/2013 12:27 AM, Marcus Sorensen wrote:
> >
> >>     I'm ok with a symptom fix on our end, if the root cause is in
> >> Libvirt we can't do much about that. This is the sort of patch that
> >> tends to get pulled into the regular update cycle of the
> >> distributions, so unless there's more to it and it's not a good fix I
> >> imagine we will see it come through without having to wait for the
> >> next point releases. We still have to support existing users who
> >> might not be running the latest, though, so the symptom fix is
> >> probably ok as a temporary measure.
> >>
> >
> > I'm ok with not calling storagePoolRefresh every time we want a
> > capacity update, since that's also kind of I/O intensive for larger storage
> arrays.
> >
> > However, we should make sure we have a GOOD comment in the code
> about
> > this "fix", since that's the reason I initially removed the old code
> > which invoked "df".
> >
> > I'll see if I can get this libvirt patch into Ubuntu when it hits
> > libvirt upstream, since this bug is really annoying.
> >
> > Wido
> >
> >
> >
> >> On Mon, Jul 15, 2013 at 3:42 PM, Edison Su <Ed...@citrix.com> wrote:
> >>
> >>> There is a serious issue on KVM(https://issues.apache.org/**
> >>> jira/browse/CLOUDSTACK-
> 2729<https://issues.apache.org/jira/browse/CLOUDSTACK-2729>):
> >>> a libvirt storage pool can disappear on KVM host, it's easy to be
> >>> reproduced in our internal QA environment.
> >>> Wei found the root cause, is on the libvirt:
> >>> "
> >>> This is a libvirt issue. I created a ticket for it.
> >>> https://bugzilla.redhat.com/**show_bug.cgi?id=977706<https://bugzill
> >>> a.redhat.com/show_bug.cgi?id=977706>
> >>> The patch is very simple.
> >>> https://www.redhat.com/**archives/libvir-list/2013-**July/msg00635.h
> >>> tml<https://www.redhat.com/archives/libvir-list/2013-July/msg00635.h
> >>> tml>
> >>> "
> >>> But it's also introduced by CloudStack, as cloudstack will call
> >>> libvirt storage pool refresh method each time when access the
> >>> storage pool. The code is added by commit:
> >>> 2ffc9907f7b0d371737e39b7649f7a**f23026f5cf,
> >>> about less than one year ago.
> >>>
> >>> As Wei suggested, we can call storage pool refresh only if needed,
> >>> it will mitigate the issue(It's behavior I did on cloudstack
> >>> pre-4.0), but it's only treat the symptom, not the cause.
> >>> Or add a cluster wide lock, only one guy can access storage pool at
> >>> one time, we can add a file lock on NFS primary storage.
> >>> Any idea/feedback on how to fix this KVM issue?
> >>>
> >>>
> >>>
> >>>
> >

Re: How to fix libvirt storage pool refresh issue?

Posted by Wido den Hollander <wi...@widodh.nl>.
On 07/16/2013 12:34 PM, Wei ZHOU wrote:
> I agree with Wido.
>

So how do you propose how we refresh the pool?

Right now we call storagePoolRefresh() every time we do a 
getStoragePool() to the LibvirtStorageAdapter, which is called on a very 
regular basis by all KVM agents for various tasks.

We actually don't need to refresh the pools every X seconds, one every 
10 minutes or so would be fine.

Right now we are sending GetStorageStatsCommand to the KVM agents every 
minute it seems, but internally the Agent will call refresh even more.

Can't we "cache" the storage pool refresh information for some time and 
not call the refresh that often?

Refreshing also does a lot of unneeded I/O to your storage subsystem, so 
I'd say we do that less frequently.

Suggestions?

Wido

> Moreover, the file lock will cause performane degrade of VM deployment.
>
> -Wei
>
>
> 2013/7/16 Wido den Hollander <wi...@widodh.nl>
>
>> On 07/16/2013 12:27 AM, Marcus Sorensen wrote:
>>
>>>      I'm ok with a symptom fix on our end, if the root cause is in
>>> Libvirt we can't do much about that. This is the sort of patch that
>>> tends to get pulled into the regular update cycle of the
>>> distributions, so unless there's more to it and it's not a good fix I
>>> imagine we will see it come through without having to wait for the
>>> next point releases. We still have to support existing users who might
>>> not be running the latest, though, so the symptom fix is probably ok
>>> as a temporary measure.
>>>
>>
>> I'm ok with not calling storagePoolRefresh every time we want a capacity
>> update, since that's also kind of I/O intensive for larger storage arrays.
>>
>> However, we should make sure we have a GOOD comment in the code about this
>> "fix", since that's the reason I initially removed the old code which
>> invoked "df".
>>
>> I'll see if I can get this libvirt patch into Ubuntu when it hits libvirt
>> upstream, since this bug is really annoying.
>>
>> Wido
>>
>>
>>
>>> On Mon, Jul 15, 2013 at 3:42 PM, Edison Su <Ed...@citrix.com> wrote:
>>>
>>>> There is a serious issue on KVM(https://issues.apache.org/**
>>>> jira/browse/CLOUDSTACK-2729<https://issues.apache.org/jira/browse/CLOUDSTACK-2729>):
>>>> a libvirt storage pool can disappear on KVM host, it's easy to be
>>>> reproduced in our internal QA environment.
>>>> Wei found the root cause, is on the libvirt:
>>>> "
>>>> This is a libvirt issue. I created a ticket for it.
>>>> https://bugzilla.redhat.com/**show_bug.cgi?id=977706<https://bugzilla.redhat.com/show_bug.cgi?id=977706>
>>>> The patch is very simple.
>>>> https://www.redhat.com/**archives/libvir-list/2013-**July/msg00635.html<https://www.redhat.com/archives/libvir-list/2013-July/msg00635.html>
>>>> "
>>>> But it's also introduced by CloudStack, as cloudstack will call libvirt
>>>> storage pool refresh method each time when access the storage pool. The
>>>> code is added by commit: 2ffc9907f7b0d371737e39b7649f7a**f23026f5cf,
>>>> about less than one year ago.
>>>>
>>>> As Wei suggested, we can call storage pool refresh only if needed, it
>>>> will mitigate the issue(It's behavior I did on cloudstack pre-4.0), but
>>>> it's only treat the symptom, not the cause.
>>>> Or add a cluster wide lock, only one guy can access storage pool at one
>>>> time, we can add a file lock on NFS primary storage.
>>>> Any idea/feedback on how to fix this KVM issue?
>>>>
>>>>
>>>>
>>>>
>>
>


Re: How to fix libvirt storage pool refresh issue?

Posted by Wei ZHOU <us...@gmail.com>.
I agree with Wido.

Moreover, the file lock will cause performane degrade of VM deployment.

-Wei


2013/7/16 Wido den Hollander <wi...@widodh.nl>

> On 07/16/2013 12:27 AM, Marcus Sorensen wrote:
>
>>     I'm ok with a symptom fix on our end, if the root cause is in
>> Libvirt we can't do much about that. This is the sort of patch that
>> tends to get pulled into the regular update cycle of the
>> distributions, so unless there's more to it and it's not a good fix I
>> imagine we will see it come through without having to wait for the
>> next point releases. We still have to support existing users who might
>> not be running the latest, though, so the symptom fix is probably ok
>> as a temporary measure.
>>
>
> I'm ok with not calling storagePoolRefresh every time we want a capacity
> update, since that's also kind of I/O intensive for larger storage arrays.
>
> However, we should make sure we have a GOOD comment in the code about this
> "fix", since that's the reason I initially removed the old code which
> invoked "df".
>
> I'll see if I can get this libvirt patch into Ubuntu when it hits libvirt
> upstream, since this bug is really annoying.
>
> Wido
>
>
>
>> On Mon, Jul 15, 2013 at 3:42 PM, Edison Su <Ed...@citrix.com> wrote:
>>
>>> There is a serious issue on KVM(https://issues.apache.org/**
>>> jira/browse/CLOUDSTACK-2729<https://issues.apache.org/jira/browse/CLOUDSTACK-2729>):
>>> a libvirt storage pool can disappear on KVM host, it's easy to be
>>> reproduced in our internal QA environment.
>>> Wei found the root cause, is on the libvirt:
>>> "
>>> This is a libvirt issue. I created a ticket for it.
>>> https://bugzilla.redhat.com/**show_bug.cgi?id=977706<https://bugzilla.redhat.com/show_bug.cgi?id=977706>
>>> The patch is very simple.
>>> https://www.redhat.com/**archives/libvir-list/2013-**July/msg00635.html<https://www.redhat.com/archives/libvir-list/2013-July/msg00635.html>
>>> "
>>> But it's also introduced by CloudStack, as cloudstack will call libvirt
>>> storage pool refresh method each time when access the storage pool. The
>>> code is added by commit: 2ffc9907f7b0d371737e39b7649f7a**f23026f5cf,
>>> about less than one year ago.
>>>
>>> As Wei suggested, we can call storage pool refresh only if needed, it
>>> will mitigate the issue(It's behavior I did on cloudstack pre-4.0), but
>>> it's only treat the symptom, not the cause.
>>> Or add a cluster wide lock, only one guy can access storage pool at one
>>> time, we can add a file lock on NFS primary storage.
>>> Any idea/feedback on how to fix this KVM issue?
>>>
>>>
>>>
>>>
>

Re: How to fix libvirt storage pool refresh issue?

Posted by Wido den Hollander <wi...@widodh.nl>.
On 07/16/2013 12:27 AM, Marcus Sorensen wrote:
>     I'm ok with a symptom fix on our end, if the root cause is in
> Libvirt we can't do much about that. This is the sort of patch that
> tends to get pulled into the regular update cycle of the
> distributions, so unless there's more to it and it's not a good fix I
> imagine we will see it come through without having to wait for the
> next point releases. We still have to support existing users who might
> not be running the latest, though, so the symptom fix is probably ok
> as a temporary measure.

I'm ok with not calling storagePoolRefresh every time we want a capacity 
update, since that's also kind of I/O intensive for larger storage arrays.

However, we should make sure we have a GOOD comment in the code about 
this "fix", since that's the reason I initially removed the old code 
which invoked "df".

I'll see if I can get this libvirt patch into Ubuntu when it hits 
libvirt upstream, since this bug is really annoying.

Wido

>
> On Mon, Jul 15, 2013 at 3:42 PM, Edison Su <Ed...@citrix.com> wrote:
>> There is a serious issue on KVM(https://issues.apache.org/jira/browse/CLOUDSTACK-2729): a libvirt storage pool can disappear on KVM host, it's easy to be reproduced in our internal QA environment.
>> Wei found the root cause, is on the libvirt:
>> "
>> This is a libvirt issue. I created a ticket for it.
>> https://bugzilla.redhat.com/show_bug.cgi?id=977706
>> The patch is very simple.
>> https://www.redhat.com/archives/libvir-list/2013-July/msg00635.html
>> "
>> But it's also introduced by CloudStack, as cloudstack will call libvirt storage pool refresh method each time when access the storage pool. The code is added by commit: 2ffc9907f7b0d371737e39b7649f7af23026f5cf, about less than one year ago.
>>
>> As Wei suggested, we can call storage pool refresh only if needed, it will mitigate the issue(It's behavior I did on cloudstack pre-4.0), but it's only treat the symptom, not the cause.
>> Or add a cluster wide lock, only one guy can access storage pool at one time, we can add a file lock on NFS primary storage.
>> Any idea/feedback on how to fix this KVM issue?
>>
>>
>>


Re: How to fix libvirt storage pool refresh issue?

Posted by Marcus Sorensen <sh...@gmail.com>.
   I'm ok with a symptom fix on our end, if the root cause is in
Libvirt we can't do much about that. This is the sort of patch that
tends to get pulled into the regular update cycle of the
distributions, so unless there's more to it and it's not a good fix I
imagine we will see it come through without having to wait for the
next point releases. We still have to support existing users who might
not be running the latest, though, so the symptom fix is probably ok
as a temporary measure.

On Mon, Jul 15, 2013 at 3:42 PM, Edison Su <Ed...@citrix.com> wrote:
> There is a serious issue on KVM(https://issues.apache.org/jira/browse/CLOUDSTACK-2729): a libvirt storage pool can disappear on KVM host, it's easy to be reproduced in our internal QA environment.
> Wei found the root cause, is on the libvirt:
> "
> This is a libvirt issue. I created a ticket for it.
> https://bugzilla.redhat.com/show_bug.cgi?id=977706
> The patch is very simple.
> https://www.redhat.com/archives/libvir-list/2013-July/msg00635.html
> "
> But it's also introduced by CloudStack, as cloudstack will call libvirt storage pool refresh method each time when access the storage pool. The code is added by commit: 2ffc9907f7b0d371737e39b7649f7af23026f5cf, about less than one year ago.
>
> As Wei suggested, we can call storage pool refresh only if needed, it will mitigate the issue(It's behavior I did on cloudstack pre-4.0), but it's only treat the symptom, not the cause.
> Or add a cluster wide lock, only one guy can access storage pool at one time, we can add a file lock on NFS primary storage.
> Any idea/feedback on how to fix this KVM issue?
>
>
>