You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@cloudstack.apache.org by Indra Pramana <in...@sg.or.id> on 2013/10/08 04:59:05 UTC

High CPU utilization on KVM hosts while doing RBD snapshot - was Re: snapshot caused host disconnected

Dear Wido and all,

I performed some further tests last night:

(1) CPU utilization of the KVM host while RBD snapshot running is still
shooting up high even after I set global setting:
concurrent.snapshots.threshold.perhost to 2.

(2) Most of the concurrent snapshot processes will fail with either stuck
in "Creating" state, or "CreatedOnPrimary" error message.

(3) I also have adjusted some other related global settings such as
backup.snapshot.wait and job.expire.minutes, without any luck.

Any advise on the reason what causes the high CPU utilization is greatly
appreciated.

Looking forward to your reply, thank you.

Cheers.


On Mon, Oct 7, 2013 at 11:03 PM, Indra Pramana <in...@sg.or.id> wrote:

> Dear all,
>
> I also found out that when the RBD snapshot is being run, the CPU
> utilisation on the KVM host will be shooting up very high, which might
> explain why the host becomes disconnected.
>
> top - 22:49:32 up 3 days, 19:31,  1 user,  load average: 7.85, 4.97, 3.47
> Tasks: 297 total,   3 running, 294 sleeping,   0 stopped,   0 zombie
> Cpu(s):  4.5%us,  1.2%sy,  0.0%ni, 94.1%id,  0.1%wa,  0.0%hi,  0.0%si,
> 0.0%st
> Mem:  264125244k total, 77203460k used, 186921784k free,   154888k buffers
> Swap:   545788k total,        0k used,   545788k free, 60677092k cached
>
>   PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
> 18161 root      20   0 3871m  31m 8444 S  101  0.0 301:58.09 kvm
>  2790 root      20   0 43.5g 1.6g  19m S   97  0.7  45:52.42 jsvc
> 24544 root      20   0 4583m  31m 8364 S   97  0.0 425:29.48 kvm
>  6537 root      20   0     0    0    0 R   71  0.0   0:17.49 kworker/3:2
> 22546 root      20   0 6143m 2.0g 8452 S   26  0.8  55:14.07 kvm
>  4219 root      20   0 7671m 4.0g 8524 S    6  1.6 106:12.26 kvm
>  5989 root      20   0 43.2g 1.6g  232 D    6  0.6   0:08.13 jsvc
>  5993 root      20   0 43.3g 1.6g  224 D    6  0.6   0:08.36 jsvc
>
> Is it normal when snapshot is being run on the VM running on that host,
> the host's CPU utilisation will be higher than usual? How can I limit the
> CPU resources used by the snapshot?
>
>
> Looking forward to your reply, thank you.
>
> Cheers.
>
>
>
> On Mon, Oct 7, 2013 at 7:18 PM, Indra Pramana <in...@sg.or.id> wrote:
>
>> Dear all,
>>
>> I did some tests on snapshots since it's now supported for my Ceph RBD
>> primary storage in CloudStack 4.2. When I ran the snapshot for a particular
>> VM instance earlier, I noticed that this has caused the host (where the VM
>> is on) becomes disconnected.
>>
>> Here's the excerpt from the agent.log:
>>
>> http://pastebin.com/dxVV7stu
>>
>> The management-server.log doesn't much showing anything other than
>> detecting that the host was down and HA is being activated:
>>
>> http://pastebin.com/UeLiSm9K
>>
>> Anyone can advise what is causing the problem? So far there is only one
>> user doing the snapshotting and it has caused issues to the host, I can't
>> imagine what if multiple users try to do snapshotting at the same time?
>>
>> I read about snapshot job throttling which is described on the manual:
>>
>>
>> http://cloudstack.apache.org/docs/en-US/Apache_CloudStack/4.2.0/html/Admin_Guide/working-with-snapshots.html
>>
>> But I am not too sure whether this will help to resolve the problem since
>> there is only one user trying to perform snapshot and we already encounter
>> the problem already.
>>
>> Anyone can advise how I can troubleshoot further and find a solution to
>> the problem?
>>
>> Looking forward to your reply, thank you.
>>
>> Cheers.
>>
>
>

Re: High CPU utilization on KVM hosts while doing RBD snapshot - was Re: snapshot caused host disconnected

Posted by Marcus Sorensen <sh...@gmail.com>.
You may want to post this to the ceph mailing list as well.

On Mon, Oct 7, 2013 at 8:59 PM, Indra Pramana <in...@sg.or.id> wrote:
> Dear Wido and all,
>
> I performed some further tests last night:
>
> (1) CPU utilization of the KVM host while RBD snapshot running is still
> shooting up high even after I set global setting:
> concurrent.snapshots.threshold.perhost to 2.
>
> (2) Most of the concurrent snapshot processes will fail with either stuck
> in "Creating" state, or "CreatedOnPrimary" error message.
>
> (3) I also have adjusted some other related global settings such as
> backup.snapshot.wait and job.expire.minutes, without any luck.
>
> Any advise on the reason what causes the high CPU utilization is greatly
> appreciated.
>
> Looking forward to your reply, thank you.
>
> Cheers.
>
>
> On Mon, Oct 7, 2013 at 11:03 PM, Indra Pramana <in...@sg.or.id> wrote:
>
>> Dear all,
>>
>> I also found out that when the RBD snapshot is being run, the CPU
>> utilisation on the KVM host will be shooting up very high, which might
>> explain why the host becomes disconnected.
>>
>> top - 22:49:32 up 3 days, 19:31,  1 user,  load average: 7.85, 4.97, 3.47
>> Tasks: 297 total,   3 running, 294 sleeping,   0 stopped,   0 zombie
>> Cpu(s):  4.5%us,  1.2%sy,  0.0%ni, 94.1%id,  0.1%wa,  0.0%hi,  0.0%si,
>> 0.0%st
>> Mem:  264125244k total, 77203460k used, 186921784k free,   154888k buffers
>> Swap:   545788k total,        0k used,   545788k free, 60677092k cached
>>
>>   PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
>> 18161 root      20   0 3871m  31m 8444 S  101  0.0 301:58.09 kvm
>>  2790 root      20   0 43.5g 1.6g  19m S   97  0.7  45:52.42 jsvc
>> 24544 root      20   0 4583m  31m 8364 S   97  0.0 425:29.48 kvm
>>  6537 root      20   0     0    0    0 R   71  0.0   0:17.49 kworker/3:2
>> 22546 root      20   0 6143m 2.0g 8452 S   26  0.8  55:14.07 kvm
>>  4219 root      20   0 7671m 4.0g 8524 S    6  1.6 106:12.26 kvm
>>  5989 root      20   0 43.2g 1.6g  232 D    6  0.6   0:08.13 jsvc
>>  5993 root      20   0 43.3g 1.6g  224 D    6  0.6   0:08.36 jsvc
>>
>> Is it normal when snapshot is being run on the VM running on that host,
>> the host's CPU utilisation will be higher than usual? How can I limit the
>> CPU resources used by the snapshot?
>>
>>
>> Looking forward to your reply, thank you.
>>
>> Cheers.
>>
>>
>>
>> On Mon, Oct 7, 2013 at 7:18 PM, Indra Pramana <in...@sg.or.id> wrote:
>>
>>> Dear all,
>>>
>>> I did some tests on snapshots since it's now supported for my Ceph RBD
>>> primary storage in CloudStack 4.2. When I ran the snapshot for a particular
>>> VM instance earlier, I noticed that this has caused the host (where the VM
>>> is on) becomes disconnected.
>>>
>>> Here's the excerpt from the agent.log:
>>>
>>> http://pastebin.com/dxVV7stu
>>>
>>> The management-server.log doesn't much showing anything other than
>>> detecting that the host was down and HA is being activated:
>>>
>>> http://pastebin.com/UeLiSm9K
>>>
>>> Anyone can advise what is causing the problem? So far there is only one
>>> user doing the snapshotting and it has caused issues to the host, I can't
>>> imagine what if multiple users try to do snapshotting at the same time?
>>>
>>> I read about snapshot job throttling which is described on the manual:
>>>
>>>
>>> http://cloudstack.apache.org/docs/en-US/Apache_CloudStack/4.2.0/html/Admin_Guide/working-with-snapshots.html
>>>
>>> But I am not too sure whether this will help to resolve the problem since
>>> there is only one user trying to perform snapshot and we already encounter
>>> the problem already.
>>>
>>> Anyone can advise how I can troubleshoot further and find a solution to
>>> the problem?
>>>
>>> Looking forward to your reply, thank you.
>>>
>>> Cheers.
>>>
>>
>>

RE: High CPU utilization on KVM hosts while doing RBD snapshot - was Re: snapshot caused host disconnected

Posted by Suresh Sadhu <Su...@citrix.com>.
Indra,

Are you seeing high cpu utilization only thorough cloud stack or  its same even with  ceph also. please try with below commands  and share your results after performing snapshot from ceph .this will help us to isolate the problem  is it really  happening at the time of snapshot or  at the time of file conversion happening at background.

--> Create a volume ,write some data  and perform snapshot

##
rbd --pool rbd snap create --snap snapname foo
rbd snap create rbd/foo@snapname

regards
sadhu



-----Original Message-----
From: Indra Pramana [mailto:indra@sg.or.id] 
Sent: 08 October 2013 08:29
To: dev@cloudstack.apache.org; users@cloudstack.apache.org
Cc: Wido den Hollander
Subject: High CPU utilization on KVM hosts while doing RBD snapshot - was Re: snapshot caused host disconnected

Dear Wido and all,

I performed some further tests last night:

(1) CPU utilization of the KVM host while RBD snapshot running is still shooting up high even after I set global setting:
concurrent.snapshots.threshold.perhost to 2.

(2) Most of the concurrent snapshot processes will fail with either stuck in "Creating" state, or "CreatedOnPrimary" error message.

(3) I also have adjusted some other related global settings such as backup.snapshot.wait and job.expire.minutes, without any luck.

Any advise on the reason what causes the high CPU utilization is greatly appreciated.

Looking forward to your reply, thank you.

Cheers.


On Mon, Oct 7, 2013 at 11:03 PM, Indra Pramana <in...@sg.or.id> wrote:

> Dear all,
>
> I also found out that when the RBD snapshot is being run, the CPU 
> utilisation on the KVM host will be shooting up very high, which might 
> explain why the host becomes disconnected.
>
> top - 22:49:32 up 3 days, 19:31,  1 user,  load average: 7.85, 4.97, 3.47
> Tasks: 297 total,   3 running, 294 sleeping,   0 stopped,   0 zombie
> Cpu(s):  4.5%us,  1.2%sy,  0.0%ni, 94.1%id,  0.1%wa,  0.0%hi,  0.0%si, 
> 0.0%st
> Mem:  264125244k total, 77203460k used, 186921784k free,   154888k buffers
> Swap:   545788k total,        0k used,   545788k free, 60677092k cached
>
>   PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
> 18161 root      20   0 3871m  31m 8444 S  101  0.0 301:58.09 kvm
>  2790 root      20   0 43.5g 1.6g  19m S   97  0.7  45:52.42 jsvc
> 24544 root      20   0 4583m  31m 8364 S   97  0.0 425:29.48 kvm
>  6537 root      20   0     0    0    0 R   71  0.0   0:17.49 kworker/3:2
> 22546 root      20   0 6143m 2.0g 8452 S   26  0.8  55:14.07 kvm
>  4219 root      20   0 7671m 4.0g 8524 S    6  1.6 106:12.26 kvm
>  5989 root      20   0 43.2g 1.6g  232 D    6  0.6   0:08.13 jsvc
>  5993 root      20   0 43.3g 1.6g  224 D    6  0.6   0:08.36 jsvc
>
> Is it normal when snapshot is being run on the VM running on that 
> host, the host's CPU utilisation will be higher than usual? How can I 
> limit the CPU resources used by the snapshot?
>
>
> Looking forward to your reply, thank you.
>
> Cheers.
>
>
>
> On Mon, Oct 7, 2013 at 7:18 PM, Indra Pramana <in...@sg.or.id> wrote:
>
>> Dear all,
>>
>> I did some tests on snapshots since it's now supported for my Ceph 
>> RBD primary storage in CloudStack 4.2. When I ran the snapshot for a 
>> particular VM instance earlier, I noticed that this has caused the 
>> host (where the VM is on) becomes disconnected.
>>
>> Here's the excerpt from the agent.log:
>>
>> http://pastebin.com/dxVV7stu
>>
>> The management-server.log doesn't much showing anything other than 
>> detecting that the host was down and HA is being activated:
>>
>> http://pastebin.com/UeLiSm9K
>>
>> Anyone can advise what is causing the problem? So far there is only 
>> one user doing the snapshotting and it has caused issues to the host, 
>> I can't imagine what if multiple users try to do snapshotting at the same time?
>>
>> I read about snapshot job throttling which is described on the manual:
>>
>>
>> http://cloudstack.apache.org/docs/en-US/Apache_CloudStack/4.2.0/html/
>> Admin_Guide/working-with-snapshots.html
>>
>> But I am not too sure whether this will help to resolve the problem 
>> since there is only one user trying to perform snapshot and we 
>> already encounter the problem already.
>>
>> Anyone can advise how I can troubleshoot further and find a solution 
>> to the problem?
>>
>> Looking forward to your reply, thank you.
>>
>> Cheers.
>>
>
>

Re: High CPU utilization on KVM hosts while doing RBD snapshot - was Re: snapshot caused host disconnected

Posted by Marcus Sorensen <sh...@gmail.com>.
You may want to post this to the ceph mailing list as well.

On Mon, Oct 7, 2013 at 8:59 PM, Indra Pramana <in...@sg.or.id> wrote:
> Dear Wido and all,
>
> I performed some further tests last night:
>
> (1) CPU utilization of the KVM host while RBD snapshot running is still
> shooting up high even after I set global setting:
> concurrent.snapshots.threshold.perhost to 2.
>
> (2) Most of the concurrent snapshot processes will fail with either stuck
> in "Creating" state, or "CreatedOnPrimary" error message.
>
> (3) I also have adjusted some other related global settings such as
> backup.snapshot.wait and job.expire.minutes, without any luck.
>
> Any advise on the reason what causes the high CPU utilization is greatly
> appreciated.
>
> Looking forward to your reply, thank you.
>
> Cheers.
>
>
> On Mon, Oct 7, 2013 at 11:03 PM, Indra Pramana <in...@sg.or.id> wrote:
>
>> Dear all,
>>
>> I also found out that when the RBD snapshot is being run, the CPU
>> utilisation on the KVM host will be shooting up very high, which might
>> explain why the host becomes disconnected.
>>
>> top - 22:49:32 up 3 days, 19:31,  1 user,  load average: 7.85, 4.97, 3.47
>> Tasks: 297 total,   3 running, 294 sleeping,   0 stopped,   0 zombie
>> Cpu(s):  4.5%us,  1.2%sy,  0.0%ni, 94.1%id,  0.1%wa,  0.0%hi,  0.0%si,
>> 0.0%st
>> Mem:  264125244k total, 77203460k used, 186921784k free,   154888k buffers
>> Swap:   545788k total,        0k used,   545788k free, 60677092k cached
>>
>>   PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
>> 18161 root      20   0 3871m  31m 8444 S  101  0.0 301:58.09 kvm
>>  2790 root      20   0 43.5g 1.6g  19m S   97  0.7  45:52.42 jsvc
>> 24544 root      20   0 4583m  31m 8364 S   97  0.0 425:29.48 kvm
>>  6537 root      20   0     0    0    0 R   71  0.0   0:17.49 kworker/3:2
>> 22546 root      20   0 6143m 2.0g 8452 S   26  0.8  55:14.07 kvm
>>  4219 root      20   0 7671m 4.0g 8524 S    6  1.6 106:12.26 kvm
>>  5989 root      20   0 43.2g 1.6g  232 D    6  0.6   0:08.13 jsvc
>>  5993 root      20   0 43.3g 1.6g  224 D    6  0.6   0:08.36 jsvc
>>
>> Is it normal when snapshot is being run on the VM running on that host,
>> the host's CPU utilisation will be higher than usual? How can I limit the
>> CPU resources used by the snapshot?
>>
>>
>> Looking forward to your reply, thank you.
>>
>> Cheers.
>>
>>
>>
>> On Mon, Oct 7, 2013 at 7:18 PM, Indra Pramana <in...@sg.or.id> wrote:
>>
>>> Dear all,
>>>
>>> I did some tests on snapshots since it's now supported for my Ceph RBD
>>> primary storage in CloudStack 4.2. When I ran the snapshot for a particular
>>> VM instance earlier, I noticed that this has caused the host (where the VM
>>> is on) becomes disconnected.
>>>
>>> Here's the excerpt from the agent.log:
>>>
>>> http://pastebin.com/dxVV7stu
>>>
>>> The management-server.log doesn't much showing anything other than
>>> detecting that the host was down and HA is being activated:
>>>
>>> http://pastebin.com/UeLiSm9K
>>>
>>> Anyone can advise what is causing the problem? So far there is only one
>>> user doing the snapshotting and it has caused issues to the host, I can't
>>> imagine what if multiple users try to do snapshotting at the same time?
>>>
>>> I read about snapshot job throttling which is described on the manual:
>>>
>>>
>>> http://cloudstack.apache.org/docs/en-US/Apache_CloudStack/4.2.0/html/Admin_Guide/working-with-snapshots.html
>>>
>>> But I am not too sure whether this will help to resolve the problem since
>>> there is only one user trying to perform snapshot and we already encounter
>>> the problem already.
>>>
>>> Anyone can advise how I can troubleshoot further and find a solution to
>>> the problem?
>>>
>>> Looking forward to your reply, thank you.
>>>
>>> Cheers.
>>>
>>
>>

Re: High CPU utilization on KVM hosts while doing RBD snapshot - was Re: snapshot caused host disconnected

Posted by Indra Pramana <in...@sg.or.id>.
Hi Wido,

Good day to you, and thank you for your e-mail reply.

Thank you for your advice. We will try to tweak the code to throttle the
copying process to the secondary storage server.

Cheers.



On Thu, Oct 10, 2013 at 7:56 PM, Wido den Hollander <wi...@widodh.nl> wrote:

>
>
> On 10/10/2013 08:06 AM, Indra Pramana wrote:
>
>> Dear all,
>>
>> I and my colleague tried to scrutinize the source code and found the
>> script
>> which is performing the copying of the snapshot from primary storage to
>> secondary storage on this file:
>>
>> ./plugins/hypervisors/kvm/src/**com/cloud/hypervisor/kvm/**
>> storage/KVMStorageProcessor.**java
>>
>> Specifically under this function:
>>
>>      @Override
>>      public Answer backupSnapshot(CopyCommand cmd) {
>>
>> ===
>>                      File snapDir = new File(snapshotDestPath);
>>                      s_logger.debug("Attempting to create " +
>> snapDir.getAbsolutePath() + " recursively");
>>                      FileUtils.forceMkdir(snapDir);
>>
>>                      File snapFile = new File(snapshotDestPath + "/" +
>> snapshotName);
>>                      s_logger.debug("Backing up RBD snapshot to " +
>> snapFile.getAbsolutePath());
>>                      BufferedOutputStream bos = new
>> BufferedOutputStream(new
>> FileOutputStream(snapFile));
>>                      int chunkSize = 4194304;
>>                      long offset = 0;
>>                      while(true) {
>>                          byte[] buf = new byte[chunkSize];
>>
>>                          int bytes = image.read(offset, buf, chunkSize);
>>                          if (bytes <= 0) {
>>                              break;
>>                          }
>>                          bos.write(buf, 0, bytes);
>>                          offset += bytes;
>>                      }
>>                      s_logger.debug("Completed backing up RBD snapshot " +
>> snapshotName + " to  " + snapFile.getAbsolutePath() + ". Bytes written: "
>> +
>> offset);
>>                      bos.close();
>> ===
>>
>> (1) Is it safe to comment out the above lines and recompile/reinstall, to
>> prevent CloudStack from copying the snapshots from the RBD primary storage
>> to the secondary storage?
>>
>>
> No, since the management server will think it has been backed up to
> Secondary Storage, but it didn't/
>
>
>  (2) What would be the impact to CloudStack operations if we leave the
>> snapshots on primary storage without copying them to secondary storage?
>> Are
>> we still able to do restoration from the snapshots kept in the primary
>> storage?
>>
>>
> Not sure, but that's what the broken setting is for.
>
> I don't think the problem here is the code, but it's the hypevisor which
> isn't able to keep up.
>
> Some throttling might be useful here, but the code itself isn't broken as
> I can see it.
>
> The setting that you don't want to back up directly to Secondary Storage
> should actually work, since that would mitigate you problem for now.
>
> Wido
>
>  Looking forward to your reply, thank you.
>>
>> Cheers.
>>
>>
>>
>> On Wed, Oct 9, 2013 at 2:36 PM, Indra Pramana <in...@sg.or.id> wrote:
>>
>>  Hi Wido and all,
>>>
>>> Good day to you, and thank you for your e-mail reply.
>>>
>>> Yes, from the agent logs I can see that the RBD snapshot was created.
>>> However, it seems that CPU utilisation goes up drastically during the
>>> copying over of the snapshot from primary storage to the secondary
>>> storage.
>>>
>>> ===
>>> 2013-10-08 00:01:58,765 DEBUG [cloud.agent.Agent]
>>> (agentRequest-Handler-5:null) Request:Seq 34-898172006:  { Cmd , MgmtId:
>>> 161342671900, via: 34, Ver: v1, Flags: 100011,
>>> [{"org.apache.cloudstack.**storage.command.**
>>> CreateObjectCommand":{"data":{**"org.apache.cloudstack.**
>>> storage.to.SnapshotObjectTO":{**"volume":{"uuid":"0c4f8e41-**
>>> dfd8-4fc2-a22e-1a79738560a1","**volumeType":"DATADISK","**
>>> dataStore":{"org.apache.**cloudstack.storage.to.**
>>> PrimaryDataStoreTO":{"uuid":"**d433809b-01ea-3947-ba0f-**
>>> 48077244e4d6","id":214,"**poolType":"RBD","host":"
>>> ceph-mon.simplercloud.com
>>> ","path":"simplercloud-sg-01",**"port":6789}},"name":"DATA-**
>>> 2051","size":64424509440,"**path":"fc5dfa05-2431-4b42-**
>>> 804b-b2fb72e219d0","volumeId":**2289,"vmName":"i-195-2051-VM",**
>>> "accountId":195,"format":"RAW"**,"id":2289,"hypervisorType":"**
>>> KVM"},"parentSnapshotPath":"**simplercloud-sg-01/fc5dfa05-**
>>> 2431-4b42-804b-b2fb72e219d0/**61042668-23ab-4f63-8a21-**
>>> ce5a24f9c883","dataStore":{"**org.apache.cloudstack.storage.**
>>> to.PrimaryDataStoreTO":{"uuid"**:"d433809b-01ea-3947-ba0f-**
>>> 48077244e4d6","id":214,"**poolType":"RBD","host":"
>>> ceph-mon.simplercloud.com","**path":"simplercloud-sg-01","**
>>> port":6789}},"vmName":"i-195-**2051-VM","name":"test-**
>>> snapshot-and-ip-1_DATA-2051_**20131007160158","**
>>> hypervisorType":"KVM","id":22}**},"wait":0}}]
>>> }
>>> 2013-10-08 00:01:58,765 DEBUG [cloud.agent.Agent]
>>> (agentRequest-Handler-5:null) Processing command:
>>> org.apache.cloudstack.storage.**command.CreateObjectCommand
>>> 2013-10-08 00:02:08,071 DEBUG [cloud.agent.Agent]
>>> (agentRequest-Handler-1:null) Request:Seq 34-898172007:  { Cmd , MgmtId:
>>> 161342671900, via: 34, Ver: v1, Flags: 100011,
>>> [{"org.apache.cloudstack.**storage.command.**
>>> CreateObjectCommand":{"data":{**"org.apache.cloudstack.**
>>> storage.to.SnapshotObjectTO":{**"volume":{"uuid":"35d9bae0-**
>>> 1683-4a3d-9a69-ccefa18bf899","**volumeType":"DATADISK","**
>>> dataStore":{"org.apache.**cloudstack.storage.to.**
>>> PrimaryDataStoreTO":{"uuid":"**d433809b-01ea-3947-ba0f-**
>>> 48077244e4d6","id":214,"**poolType":"RBD","host":"
>>> ceph-mon.simplercloud.com
>>> ","path":"simplercloud-sg-01",**"port":6789}},"name":"DATA-**
>>> 2046","size":42949672960,"**path":"59825284-6b60-4a37-**
>>> b728-755b3752a755","volumeId":**2278,"vmName":"i-190-2046-VM",**
>>> "accountId":190,"format":"RAW"**,"id":2278,"hypervisorType":"**
>>> KVM"},"dataStore":{"org.**apache.cloudstack.storage.to.**
>>> PrimaryDataStoreTO":{"uuid":"**d433809b-01ea-3947-ba0f-**
>>> 48077244e4d6","id":214,"**poolType":"RBD","host":"
>>> ceph-mon.simplercloud.com","**path":"simplercloud-sg-01","**
>>> port":6789}},"vmName":"i-190-**2046-VM","name":"test-aft-**
>>> upgrade-12-win_DATA-2046_**20131007160207","**
>>> hypervisorType":"KVM","id":23}**},"wait":0}}]
>>> }
>>> ...
>>> 2013-10-08 00:02:08,191 DEBUG [kvm.storage.**KVMStorageProcessor]
>>> (agentRequest-Handler-5:null) Succesfully connected to Ceph cluster at
>>> ceph-mon.simplercloud.com:6789
>>> 2013-10-08 00:02:08,214 DEBUG [kvm.storage.**KVMStorageProcessor]
>>> (agentRequest-Handler-5:null) Attempting to create RBD snapshot
>>> fc5dfa05-2431-4b42-804b-**b2fb72e219d0@d6df6e15-d2ec-**
>>> 46a6-8b75-1f94c127dbb8
>>> ...
>>> 2013-10-08 00:02:20,821 DEBUG [kvm.storage.**KVMStorageProcessor]
>>> (agentRequest-Handler-1:null) Succesfully connected to Ceph cluster at
>>> ceph-mon.simplercloud.com:6789
>>> ...
>>> 2013-10-08 00:05:19,580 DEBUG [kvm.storage.**KVMStorageProcessor]
>>> (agentRequest-Handler-3:null) Succesfully connected to Ceph cluster at
>>> ceph-mon.simplercloud.com:6789
>>> 2013-10-08 00:05:19,610 DEBUG [kvm.storage.**KVMStorageProcessor]
>>> (agentRequest-Handler-3:null) Attempting to create
>>> /mnt/c10404d3-070e-3579-980e-**cb0d40effb7b/snapshots/190/**2277
>>> recursively
>>> 2013-10-08 00:05:20,645 DEBUG [kvm.storage.**KVMStorageProcessor]
>>> (agentRequest-Handler-3:null) Backing up RBD snapshot to
>>> /mnt/c10404d3-070e-3579-980e-**cb0d40effb7b/snapshots/190/**
>>> 2277/5c9ea7c4-03e1-455e-a785-**3c96df68cf69
>>> ===
>>>
>>> After which, depends on the size of the snapshot to be backed up, the KVM
>>> host will suffer from high CPU utilisation and will sometimes time-out
>>> and
>>> disconnected from the management server. Then HA will kick in and
>>> everything goes haywire.
>>>
>>> This is what I have done:
>>>
>>> (1) Enabling jumbo frames (MTU 9000) on the NIC cards and the switch
>>> ports
>>> where the KVM hosts and the secondary storage server are connected to. We
>>> are using 10 GBps NIC cards and switches.
>>>
>>> (2) Performed some fine-tuning on the KVM hosts' /etc/sysctl.conf to
>>> improve the network performance:
>>>
>>> net.ipv4.tcp_wmem = 4096 65536 16777216
>>> net.ipv4.tcp_rmem = 4096 87380 16777216
>>> net.core.wmem_max = 16777216
>>> net.core.rmem_max = 16777216
>>> net.core.wmem_default = 65536
>>> net.core.rmem_default = 87380
>>> net.core.netdev_max_backlog = 30000
>>>
>>> (3) Try to disable RBD snapshot copying from primary to secondary storage
>>> by setting this global Cloudstack configuration:
>>>
>>> snapshot.backup.rightafter = false
>>>
>>> However, when I tested, the snapshotting process will still try to copy
>>> the snapshot backup over to the secondary storage. I checked on another
>>> thread and it seems to be a bug on 4.2.0.
>>>
>>> Is it actually a necessity for the snapshots to be backed up to secondary
>>> storage, or is it OK for the snapshots to stay in primary storage? It
>>> seems
>>> that the process of copying over the snapshot from primary to secondary
>>> storage is the root cause of the high CPU utilisation on the KVM hosts.
>>>
>>> Due to the above bug on point (3), is there a workaround on how we can
>>> prevent the snapshots to be copied over to the secondary storage after
>>> it's
>>> being created on the primary storage?
>>>
>>> Looking forward to your reply, thank you.
>>>
>>> Cheers.
>>>
>>>
>>>
>>> On Tue, Oct 8, 2013 at 2:41 PM, Wido den Hollander <wi...@widodh.nl>
>>> wrote:
>>>
>>>
>>>>
>>>> On 10/08/2013 04:59 AM, Indra Pramana wrote:
>>>>
>>>>  Dear Wido and all,
>>>>>
>>>>> I performed some further tests last night:
>>>>>
>>>>> (1) CPU utilization of the KVM host while RBD snapshot running is still
>>>>> shooting up high even after I set global setting:
>>>>> concurrent.snapshots.****threshold.perhost to 2.
>>>>>
>>>>>
>>>>> (2) Most of the concurrent snapshot processes will fail with either
>>>>> stuck
>>>>> in "Creating" state, or "CreatedOnPrimary" error message.
>>>>>
>>>>>
>>>>>  Hmm, that is odd. It uses rados-java to call the RBD library to create
>>>> the snapshot and afterwards it copies it to Secondary Storage.
>>>>
>>>> I'm leaving for the Ceph Days and the Build a Cloud Day afterwards in
>>>> London now, so I won't be able to look at this the coming 2 days.
>>>>
>>>>
>>>>   (3) I also have adjusted some other related global settings such as
>>>>
>>>>> backup.snapshot.wait and job.expire.minutes, without any luck.
>>>>>
>>>>> Any advise on the reason what causes the high CPU utilization is
>>>>> greatly
>>>>> appreciated.
>>>>>
>>>>>
>>>>>  You might want to set the Agent log to debug and see if the RBD
>>>> snapshot
>>>> was created, it should log that: https://git-wip-us.apache.org/****<https://git-wip-us.apache.org/**>
>>>> repos/asf?p=cloudstack.git;a=****blob;f=plugins/hypervisors/**
>>>> kvm/src/com/cloud/hypervisor/****kvm/storage/****
>>>> KVMStorageProcessor.java;h=*
>>>> ***1b883519073acc7514b66857e080a4****64714c4324;hb=4.2#l1091<http**
>>>> s://git-wip-us.apache.org/**repos/asf?p=cloudstack.git;a=**
>>>> blob;f=plugins/hypervisors/**kvm/src/com/cloud/hypervisor/**
>>>> kvm/storage/**KVMStorageProcessor.java;h=**
>>>> 1b883519073acc7514b66857e080a4**64714c4324;hb=4.2#l1091<https://git-wip-us.apache.org/repos/asf?p=cloudstack.git;a=blob;f=plugins/hypervisors/kvm/src/com/cloud/hypervisor/kvm/storage/KVMStorageProcessor.java;h=1b883519073acc7514b66857e080a464714c4324;hb=4.2#l1091>
>>>> >
>>>>
>>>>
>>>> "Attempting to create RBD snapshot"
>>>>
>>>> If that succeeds the problem lies with backing up the snapshot to
>>>> Secondary Storage.
>>>>
>>>> Wido
>>>>
>>>>
>>>>   Looking forward to your reply, thank you.
>>>>
>>>>>
>>>>> Cheers.
>>>>>
>>>>>
>>>>> On Mon, Oct 7, 2013 at 11:03 PM, Indra Pramana <in...@sg.or.id> wrote:
>>>>>
>>>>>   Dear all,
>>>>>
>>>>>>
>>>>>> I also found out that when the RBD snapshot is being run, the CPU
>>>>>> utilisation on the KVM host will be shooting up very high, which might
>>>>>> explain why the host becomes disconnected.
>>>>>>
>>>>>> top - 22:49:32 up 3 days, 19:31,  1 user,  load average: 7.85, 4.97,
>>>>>> 3.47
>>>>>> Tasks: 297 total,   3 running, 294 sleeping,   0 stopped,   0 zombie
>>>>>> Cpu(s):  4.5%us,  1.2%sy,  0.0%ni, 94.1%id,  0.1%wa,  0.0%hi,  0.0%si,
>>>>>> 0.0%st
>>>>>> Mem:  264125244k total, 77203460k used, 186921784k free,   154888k
>>>>>> buffers
>>>>>> Swap:   545788k total,        0k used,   545788k free, 60677092k
>>>>>> cached
>>>>>>
>>>>>>     PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
>>>>>> 18161 root      20   0 3871m  31m 8444 S  101  0.0 301:58.09 kvm
>>>>>>    2790 root      20   0 43.5g 1.6g  19m S   97  0.7  45:52.42 jsvc
>>>>>> 24544 root      20   0 4583m  31m 8364 S   97  0.0 425:29.48 kvm
>>>>>>    6537 root      20   0     0    0    0 R   71  0.0   0:17.49
>>>>>> kworker/3:2
>>>>>> 22546 root      20   0 6143m 2.0g 8452 S   26  0.8  55:14.07 kvm
>>>>>>    4219 root      20   0 7671m 4.0g 8524 S    6  1.6 106:12.26 kvm
>>>>>>    5989 root      20   0 43.2g 1.6g  232 D    6  0.6   0:08.13 jsvc
>>>>>>    5993 root      20   0 43.3g 1.6g  224 D    6  0.6   0:08.36 jsvc
>>>>>>
>>>>>> Is it normal when snapshot is being run on the VM running on that
>>>>>> host,
>>>>>> the host's CPU utilisation will be higher than usual? How can I limit
>>>>>> the
>>>>>> CPU resources used by the snapshot?
>>>>>>
>>>>>>
>>>>>> Looking forward to your reply, thank you.
>>>>>>
>>>>>> Cheers.
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Mon, Oct 7, 2013 at 7:18 PM, Indra Pramana <in...@sg.or.id> wrote:
>>>>>>
>>>>>>   Dear all,
>>>>>>
>>>>>>>
>>>>>>> I did some tests on snapshots since it's now supported for my Ceph
>>>>>>> RBD
>>>>>>> primary storage in CloudStack 4.2. When I ran the snapshot for a
>>>>>>> particular
>>>>>>> VM instance earlier, I noticed that this has caused the host (where
>>>>>>> the VM
>>>>>>> is on) becomes disconnected.
>>>>>>>
>>>>>>> Here's the excerpt from the agent.log:
>>>>>>>
>>>>>>> http://pastebin.com/dxVV7stu
>>>>>>>
>>>>>>> The management-server.log doesn't much showing anything other than
>>>>>>> detecting that the host was down and HA is being activated:
>>>>>>>
>>>>>>> http://pastebin.com/UeLiSm9K
>>>>>>>
>>>>>>> Anyone can advise what is causing the problem? So far there is only
>>>>>>> one
>>>>>>> user doing the snapshotting and it has caused issues to the host, I
>>>>>>> can't
>>>>>>> imagine what if multiple users try to do snapshotting at the same
>>>>>>> time?
>>>>>>>
>>>>>>> I read about snapshot job throttling which is described on the
>>>>>>> manual:
>>>>>>>
>>>>>>>
>>>>>>> http://cloudstack.apache.org/****docs/en-US/Apache_CloudStack/****<http://cloudstack.apache.org/**docs/en-US/Apache_CloudStack/**>
>>>>>>> 4.2.0/html/Admin_Guide/****working-with-snapshots.html<ht**
>>>>>>> tp://cloudstack.apache.org/**docs/en-US/Apache_CloudStack/**
>>>>>>> 4.2.0/html/Admin_Guide/**working-with-snapshots.html<http://cloudstack.apache.org/docs/en-US/Apache_CloudStack/4.2.0/html/Admin_Guide/working-with-snapshots.html>
>>>>>>> >
>>>>>>>
>>>>>>>
>>>>>>> But I am not too sure whether this will help to resolve the problem
>>>>>>> since
>>>>>>> there is only one user trying to perform snapshot and we already
>>>>>>> encounter
>>>>>>> the problem already.
>>>>>>>
>>>>>>> Anyone can advise how I can troubleshoot further and find a solution
>>>>>>> to
>>>>>>> the problem?
>>>>>>>
>>>>>>> Looking forward to your reply, thank you.
>>>>>>>
>>>>>>> Cheers.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>
>>

Re: High CPU utilization on KVM hosts while doing RBD snapshot - was Re: snapshot caused host disconnected

Posted by Wido den Hollander <wi...@widodh.nl>.

On 10/10/2013 08:06 AM, Indra Pramana wrote:
> Dear all,
>
> I and my colleague tried to scrutinize the source code and found the script
> which is performing the copying of the snapshot from primary storage to
> secondary storage on this file:
>
> ./plugins/hypervisors/kvm/src/com/cloud/hypervisor/kvm/storage/KVMStorageProcessor.java
>
> Specifically under this function:
>
>      @Override
>      public Answer backupSnapshot(CopyCommand cmd) {
>
> ===
>                      File snapDir = new File(snapshotDestPath);
>                      s_logger.debug("Attempting to create " +
> snapDir.getAbsolutePath() + " recursively");
>                      FileUtils.forceMkdir(snapDir);
>
>                      File snapFile = new File(snapshotDestPath + "/" +
> snapshotName);
>                      s_logger.debug("Backing up RBD snapshot to " +
> snapFile.getAbsolutePath());
>                      BufferedOutputStream bos = new BufferedOutputStream(new
> FileOutputStream(snapFile));
>                      int chunkSize = 4194304;
>                      long offset = 0;
>                      while(true) {
>                          byte[] buf = new byte[chunkSize];
>
>                          int bytes = image.read(offset, buf, chunkSize);
>                          if (bytes <= 0) {
>                              break;
>                          }
>                          bos.write(buf, 0, bytes);
>                          offset += bytes;
>                      }
>                      s_logger.debug("Completed backing up RBD snapshot " +
> snapshotName + " to  " + snapFile.getAbsolutePath() + ". Bytes written: " +
> offset);
>                      bos.close();
> ===
>
> (1) Is it safe to comment out the above lines and recompile/reinstall, to
> prevent CloudStack from copying the snapshots from the RBD primary storage
> to the secondary storage?
>

No, since the management server will think it has been backed up to 
Secondary Storage, but it didn't/

> (2) What would be the impact to CloudStack operations if we leave the
> snapshots on primary storage without copying them to secondary storage? Are
> we still able to do restoration from the snapshots kept in the primary
> storage?
>

Not sure, but that's what the broken setting is for.

I don't think the problem here is the code, but it's the hypevisor which 
isn't able to keep up.

Some throttling might be useful here, but the code itself isn't broken 
as I can see it.

The setting that you don't want to back up directly to Secondary Storage 
should actually work, since that would mitigate you problem for now.

Wido

> Looking forward to your reply, thank you.
>
> Cheers.
>
>
>
> On Wed, Oct 9, 2013 at 2:36 PM, Indra Pramana <in...@sg.or.id> wrote:
>
>> Hi Wido and all,
>>
>> Good day to you, and thank you for your e-mail reply.
>>
>> Yes, from the agent logs I can see that the RBD snapshot was created.
>> However, it seems that CPU utilisation goes up drastically during the
>> copying over of the snapshot from primary storage to the secondary storage.
>>
>> ===
>> 2013-10-08 00:01:58,765 DEBUG [cloud.agent.Agent]
>> (agentRequest-Handler-5:null) Request:Seq 34-898172006:  { Cmd , MgmtId:
>> 161342671900, via: 34, Ver: v1, Flags: 100011,
>> [{"org.apache.cloudstack.storage.command.CreateObjectCommand":{"data":{"org.apache.cloudstack.storage.to.SnapshotObjectTO":{"volume":{"uuid":"0c4f8e41-dfd8-4fc2-a22e-1a79738560a1","volumeType":"DATADISK","dataStore":{"org.apache.cloudstack.storage.to.PrimaryDataStoreTO":{"uuid":"d433809b-01ea-3947-ba0f-48077244e4d6","id":214,"poolType":"RBD","host":"
>> ceph-mon.simplercloud.com
>> ","path":"simplercloud-sg-01","port":6789}},"name":"DATA-2051","size":64424509440,"path":"fc5dfa05-2431-4b42-804b-b2fb72e219d0","volumeId":2289,"vmName":"i-195-2051-VM","accountId":195,"format":"RAW","id":2289,"hypervisorType":"KVM"},"parentSnapshotPath":"simplercloud-sg-01/fc5dfa05-2431-4b42-804b-b2fb72e219d0/61042668-23ab-4f63-8a21-ce5a24f9c883","dataStore":{"org.apache.cloudstack.storage.to.PrimaryDataStoreTO":{"uuid":"d433809b-01ea-3947-ba0f-48077244e4d6","id":214,"poolType":"RBD","host":"
>> ceph-mon.simplercloud.com","path":"simplercloud-sg-01","port":6789}},"vmName":"i-195-2051-VM","name":"test-snapshot-and-ip-1_DATA-2051_20131007160158","hypervisorType":"KVM","id":22}},"wait":0}}]
>> }
>> 2013-10-08 00:01:58,765 DEBUG [cloud.agent.Agent]
>> (agentRequest-Handler-5:null) Processing command:
>> org.apache.cloudstack.storage.command.CreateObjectCommand
>> 2013-10-08 00:02:08,071 DEBUG [cloud.agent.Agent]
>> (agentRequest-Handler-1:null) Request:Seq 34-898172007:  { Cmd , MgmtId:
>> 161342671900, via: 34, Ver: v1, Flags: 100011,
>> [{"org.apache.cloudstack.storage.command.CreateObjectCommand":{"data":{"org.apache.cloudstack.storage.to.SnapshotObjectTO":{"volume":{"uuid":"35d9bae0-1683-4a3d-9a69-ccefa18bf899","volumeType":"DATADISK","dataStore":{"org.apache.cloudstack.storage.to.PrimaryDataStoreTO":{"uuid":"d433809b-01ea-3947-ba0f-48077244e4d6","id":214,"poolType":"RBD","host":"
>> ceph-mon.simplercloud.com
>> ","path":"simplercloud-sg-01","port":6789}},"name":"DATA-2046","size":42949672960,"path":"59825284-6b60-4a37-b728-755b3752a755","volumeId":2278,"vmName":"i-190-2046-VM","accountId":190,"format":"RAW","id":2278,"hypervisorType":"KVM"},"dataStore":{"org.apache.cloudstack.storage.to.PrimaryDataStoreTO":{"uuid":"d433809b-01ea-3947-ba0f-48077244e4d6","id":214,"poolType":"RBD","host":"
>> ceph-mon.simplercloud.com","path":"simplercloud-sg-01","port":6789}},"vmName":"i-190-2046-VM","name":"test-aft-upgrade-12-win_DATA-2046_20131007160207","hypervisorType":"KVM","id":23}},"wait":0}}]
>> }
>> ...
>> 2013-10-08 00:02:08,191 DEBUG [kvm.storage.KVMStorageProcessor]
>> (agentRequest-Handler-5:null) Succesfully connected to Ceph cluster at
>> ceph-mon.simplercloud.com:6789
>> 2013-10-08 00:02:08,214 DEBUG [kvm.storage.KVMStorageProcessor]
>> (agentRequest-Handler-5:null) Attempting to create RBD snapshot
>> fc5dfa05-2431-4b42-804b-b2fb72e219d0@d6df6e15-d2ec-46a6-8b75-1f94c127dbb8
>> ...
>> 2013-10-08 00:02:20,821 DEBUG [kvm.storage.KVMStorageProcessor]
>> (agentRequest-Handler-1:null) Succesfully connected to Ceph cluster at
>> ceph-mon.simplercloud.com:6789
>> ...
>> 2013-10-08 00:05:19,580 DEBUG [kvm.storage.KVMStorageProcessor]
>> (agentRequest-Handler-3:null) Succesfully connected to Ceph cluster at
>> ceph-mon.simplercloud.com:6789
>> 2013-10-08 00:05:19,610 DEBUG [kvm.storage.KVMStorageProcessor]
>> (agentRequest-Handler-3:null) Attempting to create
>> /mnt/c10404d3-070e-3579-980e-cb0d40effb7b/snapshots/190/2277 recursively
>> 2013-10-08 00:05:20,645 DEBUG [kvm.storage.KVMStorageProcessor]
>> (agentRequest-Handler-3:null) Backing up RBD snapshot to
>> /mnt/c10404d3-070e-3579-980e-cb0d40effb7b/snapshots/190/2277/5c9ea7c4-03e1-455e-a785-3c96df68cf69
>> ===
>>
>> After which, depends on the size of the snapshot to be backed up, the KVM
>> host will suffer from high CPU utilisation and will sometimes time-out and
>> disconnected from the management server. Then HA will kick in and
>> everything goes haywire.
>>
>> This is what I have done:
>>
>> (1) Enabling jumbo frames (MTU 9000) on the NIC cards and the switch ports
>> where the KVM hosts and the secondary storage server are connected to. We
>> are using 10 GBps NIC cards and switches.
>>
>> (2) Performed some fine-tuning on the KVM hosts' /etc/sysctl.conf to
>> improve the network performance:
>>
>> net.ipv4.tcp_wmem = 4096 65536 16777216
>> net.ipv4.tcp_rmem = 4096 87380 16777216
>> net.core.wmem_max = 16777216
>> net.core.rmem_max = 16777216
>> net.core.wmem_default = 65536
>> net.core.rmem_default = 87380
>> net.core.netdev_max_backlog = 30000
>>
>> (3) Try to disable RBD snapshot copying from primary to secondary storage
>> by setting this global Cloudstack configuration:
>>
>> snapshot.backup.rightafter = false
>>
>> However, when I tested, the snapshotting process will still try to copy
>> the snapshot backup over to the secondary storage. I checked on another
>> thread and it seems to be a bug on 4.2.0.
>>
>> Is it actually a necessity for the snapshots to be backed up to secondary
>> storage, or is it OK for the snapshots to stay in primary storage? It seems
>> that the process of copying over the snapshot from primary to secondary
>> storage is the root cause of the high CPU utilisation on the KVM hosts.
>>
>> Due to the above bug on point (3), is there a workaround on how we can
>> prevent the snapshots to be copied over to the secondary storage after it's
>> being created on the primary storage?
>>
>> Looking forward to your reply, thank you.
>>
>> Cheers.
>>
>>
>>
>> On Tue, Oct 8, 2013 at 2:41 PM, Wido den Hollander <wi...@widodh.nl> wrote:
>>
>>>
>>>
>>> On 10/08/2013 04:59 AM, Indra Pramana wrote:
>>>
>>>> Dear Wido and all,
>>>>
>>>> I performed some further tests last night:
>>>>
>>>> (1) CPU utilization of the KVM host while RBD snapshot running is still
>>>> shooting up high even after I set global setting:
>>>> concurrent.snapshots.**threshold.perhost to 2.
>>>>
>>>> (2) Most of the concurrent snapshot processes will fail with either stuck
>>>> in "Creating" state, or "CreatedOnPrimary" error message.
>>>>
>>>>
>>> Hmm, that is odd. It uses rados-java to call the RBD library to create
>>> the snapshot and afterwards it copies it to Secondary Storage.
>>>
>>> I'm leaving for the Ceph Days and the Build a Cloud Day afterwards in
>>> London now, so I won't be able to look at this the coming 2 days.
>>>
>>>
>>>   (3) I also have adjusted some other related global settings such as
>>>> backup.snapshot.wait and job.expire.minutes, without any luck.
>>>>
>>>> Any advise on the reason what causes the high CPU utilization is greatly
>>>> appreciated.
>>>>
>>>>
>>> You might want to set the Agent log to debug and see if the RBD snapshot
>>> was created, it should log that: https://git-wip-us.apache.org/**
>>> repos/asf?p=cloudstack.git;a=**blob;f=plugins/hypervisors/**
>>> kvm/src/com/cloud/hypervisor/**kvm/storage/**KVMStorageProcessor.java;h=*
>>> *1b883519073acc7514b66857e080a4**64714c4324;hb=4.2#l1091<https://git-wip-us.apache.org/repos/asf?p=cloudstack.git;a=blob;f=plugins/hypervisors/kvm/src/com/cloud/hypervisor/kvm/storage/KVMStorageProcessor.java;h=1b883519073acc7514b66857e080a464714c4324;hb=4.2#l1091>
>>>
>>> "Attempting to create RBD snapshot"
>>>
>>> If that succeeds the problem lies with backing up the snapshot to
>>> Secondary Storage.
>>>
>>> Wido
>>>
>>>
>>>   Looking forward to your reply, thank you.
>>>>
>>>> Cheers.
>>>>
>>>>
>>>> On Mon, Oct 7, 2013 at 11:03 PM, Indra Pramana <in...@sg.or.id> wrote:
>>>>
>>>>   Dear all,
>>>>>
>>>>> I also found out that when the RBD snapshot is being run, the CPU
>>>>> utilisation on the KVM host will be shooting up very high, which might
>>>>> explain why the host becomes disconnected.
>>>>>
>>>>> top - 22:49:32 up 3 days, 19:31,  1 user,  load average: 7.85, 4.97,
>>>>> 3.47
>>>>> Tasks: 297 total,   3 running, 294 sleeping,   0 stopped,   0 zombie
>>>>> Cpu(s):  4.5%us,  1.2%sy,  0.0%ni, 94.1%id,  0.1%wa,  0.0%hi,  0.0%si,
>>>>> 0.0%st
>>>>> Mem:  264125244k total, 77203460k used, 186921784k free,   154888k
>>>>> buffers
>>>>> Swap:   545788k total,        0k used,   545788k free, 60677092k cached
>>>>>
>>>>>     PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
>>>>> 18161 root      20   0 3871m  31m 8444 S  101  0.0 301:58.09 kvm
>>>>>    2790 root      20   0 43.5g 1.6g  19m S   97  0.7  45:52.42 jsvc
>>>>> 24544 root      20   0 4583m  31m 8364 S   97  0.0 425:29.48 kvm
>>>>>    6537 root      20   0     0    0    0 R   71  0.0   0:17.49
>>>>> kworker/3:2
>>>>> 22546 root      20   0 6143m 2.0g 8452 S   26  0.8  55:14.07 kvm
>>>>>    4219 root      20   0 7671m 4.0g 8524 S    6  1.6 106:12.26 kvm
>>>>>    5989 root      20   0 43.2g 1.6g  232 D    6  0.6   0:08.13 jsvc
>>>>>    5993 root      20   0 43.3g 1.6g  224 D    6  0.6   0:08.36 jsvc
>>>>>
>>>>> Is it normal when snapshot is being run on the VM running on that host,
>>>>> the host's CPU utilisation will be higher than usual? How can I limit
>>>>> the
>>>>> CPU resources used by the snapshot?
>>>>>
>>>>>
>>>>> Looking forward to your reply, thank you.
>>>>>
>>>>> Cheers.
>>>>>
>>>>>
>>>>>
>>>>> On Mon, Oct 7, 2013 at 7:18 PM, Indra Pramana <in...@sg.or.id> wrote:
>>>>>
>>>>>   Dear all,
>>>>>>
>>>>>> I did some tests on snapshots since it's now supported for my Ceph RBD
>>>>>> primary storage in CloudStack 4.2. When I ran the snapshot for a
>>>>>> particular
>>>>>> VM instance earlier, I noticed that this has caused the host (where
>>>>>> the VM
>>>>>> is on) becomes disconnected.
>>>>>>
>>>>>> Here's the excerpt from the agent.log:
>>>>>>
>>>>>> http://pastebin.com/dxVV7stu
>>>>>>
>>>>>> The management-server.log doesn't much showing anything other than
>>>>>> detecting that the host was down and HA is being activated:
>>>>>>
>>>>>> http://pastebin.com/UeLiSm9K
>>>>>>
>>>>>> Anyone can advise what is causing the problem? So far there is only one
>>>>>> user doing the snapshotting and it has caused issues to the host, I
>>>>>> can't
>>>>>> imagine what if multiple users try to do snapshotting at the same time?
>>>>>>
>>>>>> I read about snapshot job throttling which is described on the manual:
>>>>>>
>>>>>>
>>>>>> http://cloudstack.apache.org/**docs/en-US/Apache_CloudStack/**
>>>>>> 4.2.0/html/Admin_Guide/**working-with-snapshots.html<http://cloudstack.apache.org/docs/en-US/Apache_CloudStack/4.2.0/html/Admin_Guide/working-with-snapshots.html>
>>>>>>
>>>>>> But I am not too sure whether this will help to resolve the problem
>>>>>> since
>>>>>> there is only one user trying to perform snapshot and we already
>>>>>> encounter
>>>>>> the problem already.
>>>>>>
>>>>>> Anyone can advise how I can troubleshoot further and find a solution to
>>>>>> the problem?
>>>>>>
>>>>>> Looking forward to your reply, thank you.
>>>>>>
>>>>>> Cheers.
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>
>>
>

RE: High CPU utilization on KVM hosts while doing RBD snapshot - was Re: snapshot caused host disconnected

Posted by Edison Su <Ed...@citrix.com>.

> -----Original Message-----
> From: Indra Pramana [mailto:indra@sg.or.id]
> Sent: Wednesday, October 09, 2013 11:07 PM
> To: dev@cloudstack.apache.org; users@cloudstack.apache.org
> Subject: Re: High CPU utilization on KVM hosts while doing RBD snapshot -
> was Re: snapshot caused host disconnected
> 
> Dear all,
> 
> I and my colleague tried to scrutinize the source code and found the script
> which is performing the copying of the snapshot from primary storage to
> secondary storage on this file:
> 
> ./plugins/hypervisors/kvm/src/com/cloud/hypervisor/kvm/storage/KVMSto
> rageProcessor.java
> 
> Specifically under this function:
> 
>     @Override
>     public Answer backupSnapshot(CopyCommand cmd) {
> 
> ===
>                     File snapDir = new File(snapshotDestPath);
>                     s_logger.debug("Attempting to create " +
> snapDir.getAbsolutePath() + " recursively");
>                     FileUtils.forceMkdir(snapDir);
> 
>                     File snapFile = new File(snapshotDestPath + "/" + snapshotName);
>                     s_logger.debug("Backing up RBD snapshot to " +
> snapFile.getAbsolutePath());
>                     BufferedOutputStream bos = new BufferedOutputStream(new
> FileOutputStream(snapFile));
>                     int chunkSize = 4194304;
>                     long offset = 0;
>                     while(true) {
>                         byte[] buf = new byte[chunkSize];
> 
>                         int bytes = image.read(offset, buf, chunkSize);
>                         if (bytes <= 0) {
>                             break;
>                         }
>                         bos.write(buf, 0, bytes);
>                         offset += bytes;
>                     }
>                     s_logger.debug("Completed backing up RBD snapshot " +
> snapshotName + " to  " + snapFile.getAbsolutePath() + ". Bytes written: " +
> offset);
>                     bos.close();
> ===
> 
> (1) Is it safe to comment out the above lines and recompile/reinstall, to
> prevent CloudStack from copying the snapshots from the RBD primary
> storage to the secondary storage?
> 
> (2) What would be the impact to CloudStack operations if we leave the
> snapshots on primary storage without copying them to secondary storage?
> Are we still able to do restoration from the snapshots kept in the primary
> storage?

Yes, it's doable, but need to write a SnapshotStrategy in mgt server. The current SnapshotStrategy will always backup snapshot into secondary after taking snapshot.
You can write a new SnapshotStrategy, just take snapshot without backup to secondary storage.
After the commit: 180cfa19e87b909cb1c8a738359e31a6111b11c5 checked into master, you will get a lot of freedom to manipulate snapshot.

> 
> Looking forward to your reply, thank you.
> 
> Cheers.
> 
> 
> 
> On Wed, Oct 9, 2013 at 2:36 PM, Indra Pramana <in...@sg.or.id> wrote:
> 
> > Hi Wido and all,
> >
> > Good day to you, and thank you for your e-mail reply.
> >
> > Yes, from the agent logs I can see that the RBD snapshot was created.
> > However, it seems that CPU utilisation goes up drastically during the
> > copying over of the snapshot from primary storage to the secondary
> storage.
> >
> > ===
> > 2013-10-08 00:01:58,765 DEBUG [cloud.agent.Agent]
> > (agentRequest-Handler-5:null) Request:Seq 34-898172006:  { Cmd , MgmtId:
> > 161342671900, via: 34, Ver: v1, Flags: 100011,
> >
> [{"org.apache.cloudstack.storage.command.CreateObjectCommand":{"data"
> :{"org.apache.cloudstack.storage.to.SnapshotObjectTO":{"volume":{"uuid":"
> 0c4f8e41-dfd8-4fc2-a22e-
> 1a79738560a1","volumeType":"DATADISK","dataStore":{"org.apache.cloudst
> ack.storage.to.PrimaryDataStoreTO":{"uuid":"d433809b-01ea-3947-ba0f-
> 48077244e4d6","id":214,"poolType":"RBD","host":"
> > ceph-mon.simplercloud.com
> > ","path":"simplercloud-sg-01","port":6789}},"name":"DATA-
> 2051","size":64424509440,"path":"fc5dfa05-2431-4b42-804b-
> b2fb72e219d0","volumeId":2289,"vmName":"i-195-2051-
> VM","accountId":195,"format":"RAW","id":2289,"hypervisorType":"KVM"},"
> parentSnapshotPath":"simplercloud-sg-01/fc5dfa05-2431-4b42-804b-
> b2fb72e219d0/61042668-23ab-4f63-8a21-
> ce5a24f9c883","dataStore":{"org.apache.cloudstack.storage.to.PrimaryDataS
> toreTO":{"uuid":"d433809b-01ea-3947-ba0f-
> 48077244e4d6","id":214,"poolType":"RBD","host":"
> > ceph-mon.simplercloud.com","path":"simplercloud-sg-01","port":6789}},"
> > vmName":"i-195-2051-VM","name":"test-snapshot-and-ip-1_DATA-
> 2051_20131
> > 007160158","hypervisorType":"KVM","id":22}},"wait":0}}]
> > }
> > 2013-10-08 00:01:58,765 DEBUG [cloud.agent.Agent]
> > (agentRequest-Handler-5:null) Processing command:
> > org.apache.cloudstack.storage.command.CreateObjectCommand
> > 2013-10-08 00:02:08,071 DEBUG [cloud.agent.Agent]
> > (agentRequest-Handler-1:null) Request:Seq 34-898172007:  { Cmd , MgmtId:
> > 161342671900, via: 34, Ver: v1, Flags: 100011,
> >
> [{"org.apache.cloudstack.storage.command.CreateObjectCommand":{"data"
> :{"org.apache.cloudstack.storage.to.SnapshotObjectTO":{"volume":{"uuid":"
> 35d9bae0-1683-4a3d-9a69-
> ccefa18bf899","volumeType":"DATADISK","dataStore":{"org.apache.cloudsta
> ck.storage.to.PrimaryDataStoreTO":{"uuid":"d433809b-01ea-3947-ba0f-
> 48077244e4d6","id":214,"poolType":"RBD","host":"
> > ceph-mon.simplercloud.com
> > ","path":"simplercloud-sg-01","port":6789}},"name":"DATA-
> 2046","size":42949672960,"path":"59825284-6b60-4a37-b728-
> 755b3752a755","volumeId":2278,"vmName":"i-190-2046-
> VM","accountId":190,"format":"RAW","id":2278,"hypervisorType":"KVM"},"
> dataStore":{"org.apache.cloudstack.storage.to.PrimaryDataStoreTO":{"uuid"
> :"d433809b-01ea-3947-ba0f-
> 48077244e4d6","id":214,"poolType":"RBD","host":"
> > ceph-mon.simplercloud.com","path":"simplercloud-sg-01","port":6789}},"
> > vmName":"i-190-2046-VM","name":"test-aft-upgrade-12-win_DATA-
> 2046_2013
> > 1007160207","hypervisorType":"KVM","id":23}},"wait":0}}]
> > }
> > ...
> > 2013-10-08 00:02:08,191 DEBUG [kvm.storage.KVMStorageProcessor]
> > (agentRequest-Handler-5:null) Succesfully connected to Ceph cluster at
> > ceph-mon.simplercloud.com:6789
> > 2013-10-08 00:02:08,214 DEBUG [kvm.storage.KVMStorageProcessor]
> > (agentRequest-Handler-5:null) Attempting to create RBD snapshot
> > fc5dfa05-2431-4b42-804b-b2fb72e219d0@d6df6e15-d2ec-46a6-8b75-
> 1f94c127d
> > bb8
> > ...
> > 2013-10-08 00:02:20,821 DEBUG [kvm.storage.KVMStorageProcessor]
> > (agentRequest-Handler-1:null) Succesfully connected to Ceph cluster at
> > ceph-mon.simplercloud.com:6789
> > ...
> > 2013-10-08 00:05:19,580 DEBUG [kvm.storage.KVMStorageProcessor]
> > (agentRequest-Handler-3:null) Succesfully connected to Ceph cluster at
> > ceph-mon.simplercloud.com:6789
> > 2013-10-08 00:05:19,610 DEBUG [kvm.storage.KVMStorageProcessor]
> > (agentRequest-Handler-3:null) Attempting to create
> > /mnt/c10404d3-070e-3579-980e-cb0d40effb7b/snapshots/190/2277
> > recursively
> > 2013-10-08 00:05:20,645 DEBUG [kvm.storage.KVMStorageProcessor]
> > (agentRequest-Handler-3:null) Backing up RBD snapshot to
> > /mnt/c10404d3-070e-3579-980e-
> cb0d40effb7b/snapshots/190/2277/5c9ea7c4-
> > 03e1-455e-a785-3c96df68cf69
> > ===
> >
> > After which, depends on the size of the snapshot to be backed up, the
> > KVM host will suffer from high CPU utilisation and will sometimes
> > time-out and disconnected from the management server. Then HA will
> > kick in and everything goes haywire.
> >
> > This is what I have done:
> >
> > (1) Enabling jumbo frames (MTU 9000) on the NIC cards and the switch
> > ports where the KVM hosts and the secondary storage server are
> > connected to. We are using 10 GBps NIC cards and switches.
> >
> > (2) Performed some fine-tuning on the KVM hosts' /etc/sysctl.conf to
> > improve the network performance:
> >
> > net.ipv4.tcp_wmem = 4096 65536 16777216 net.ipv4.tcp_rmem = 4096
> 87380
> > 16777216 net.core.wmem_max = 16777216 net.core.rmem_max =
> 16777216
> > net.core.wmem_default = 65536 net.core.rmem_default = 87380
> > net.core.netdev_max_backlog = 30000
> >
> > (3) Try to disable RBD snapshot copying from primary to secondary
> > storage by setting this global Cloudstack configuration:
> >
> > snapshot.backup.rightafter = false
> >
> > However, when I tested, the snapshotting process will still try to
> > copy the snapshot backup over to the secondary storage. I checked on
> > another thread and it seems to be a bug on 4.2.0.
> >
> > Is it actually a necessity for the snapshots to be backed up to
> > secondary storage, or is it OK for the snapshots to stay in primary
> > storage? It seems that the process of copying over the snapshot from
> > primary to secondary storage is the root cause of the high CPU utilisation
> on the KVM hosts.
> >
> > Due to the above bug on point (3), is there a workaround on how we can
> > prevent the snapshots to be copied over to the secondary storage after
> > it's being created on the primary storage?
> >
> > Looking forward to your reply, thank you.
> >
> > Cheers.
> >
> >
> >
> > On Tue, Oct 8, 2013 at 2:41 PM, Wido den Hollander <wi...@widodh.nl>
> wrote:
> >
> >>
> >>
> >> On 10/08/2013 04:59 AM, Indra Pramana wrote:
> >>
> >>> Dear Wido and all,
> >>>
> >>> I performed some further tests last night:
> >>>
> >>> (1) CPU utilization of the KVM host while RBD snapshot running is
> >>> still shooting up high even after I set global setting:
> >>> concurrent.snapshots.**threshold.perhost to 2.
> >>>
> >>> (2) Most of the concurrent snapshot processes will fail with either
> >>> stuck in "Creating" state, or "CreatedOnPrimary" error message.
> >>>
> >>>
> >> Hmm, that is odd. It uses rados-java to call the RBD library to
> >> create the snapshot and afterwards it copies it to Secondary Storage.
> >>
> >> I'm leaving for the Ceph Days and the Build a Cloud Day afterwards in
> >> London now, so I won't be able to look at this the coming 2 days.
> >>
> >>
> >>  (3) I also have adjusted some other related global settings such as
> >>> backup.snapshot.wait and job.expire.minutes, without any luck.
> >>>
> >>> Any advise on the reason what causes the high CPU utilization is
> >>> greatly appreciated.
> >>>
> >>>
> >> You might want to set the Agent log to debug and see if the RBD
> >> snapshot was created, it should log that:
> >> https://git-wip-us.apache.org/**
> >> repos/asf?p=cloudstack.git;a=**blob;f=plugins/hypervisors/**
> >>
> kvm/src/com/cloud/hypervisor/**kvm/storage/**KVMStorageProcessor.ja
> va
> >> ;h=*
> >>
> *1b883519073acc7514b66857e080a4**64714c4324;hb=4.2#l1091<https://git-
> >> wip-us.apache.org/repos/asf?p=cloudstack.git;a=blob;f=plugins/hypervi
> >>
> sors/kvm/src/com/cloud/hypervisor/kvm/storage/KVMStorageProcessor.ja
> v
> >> a;h=1b883519073acc7514b66857e080a464714c4324;hb=4.2#l1091>
> >>
> >> "Attempting to create RBD snapshot"
> >>
> >> If that succeeds the problem lies with backing up the snapshot to
> >> Secondary Storage.
> >>
> >> Wido
> >>
> >>
> >>  Looking forward to your reply, thank you.
> >>>
> >>> Cheers.
> >>>
> >>>
> >>> On Mon, Oct 7, 2013 at 11:03 PM, Indra Pramana <in...@sg.or.id> wrote:
> >>>
> >>>  Dear all,
> >>>>
> >>>> I also found out that when the RBD snapshot is being run, the CPU
> >>>> utilisation on the KVM host will be shooting up very high, which
> >>>> might explain why the host becomes disconnected.
> >>>>
> >>>> top - 22:49:32 up 3 days, 19:31,  1 user,  load average: 7.85,
> >>>> 4.97,
> >>>> 3.47
> >>>> Tasks: 297 total,   3 running, 294 sleeping,   0 stopped,   0 zombie
> >>>> Cpu(s):  4.5%us,  1.2%sy,  0.0%ni, 94.1%id,  0.1%wa,  0.0%hi,
> >>>> 0.0%si, 0.0%st
> >>>> Mem:  264125244k total, 77203460k used, 186921784k free,   154888k
> >>>> buffers
> >>>> Swap:   545788k total,        0k used,   545788k free, 60677092k cached
> >>>>
> >>>>    PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+
> COMMAND
> >>>> 18161 root      20   0 3871m  31m 8444 S  101  0.0 301:58.09 kvm
> >>>>   2790 root      20   0 43.5g 1.6g  19m S   97  0.7  45:52.42 jsvc
> >>>> 24544 root      20   0 4583m  31m 8364 S   97  0.0 425:29.48 kvm
> >>>>   6537 root      20   0     0    0    0 R   71  0.0   0:17.49
> >>>> kworker/3:2
> >>>> 22546 root      20   0 6143m 2.0g 8452 S   26  0.8  55:14.07 kvm
> >>>>   4219 root      20   0 7671m 4.0g 8524 S    6  1.6 106:12.26 kvm
> >>>>   5989 root      20   0 43.2g 1.6g  232 D    6  0.6   0:08.13 jsvc
> >>>>   5993 root      20   0 43.3g 1.6g  224 D    6  0.6   0:08.36 jsvc
> >>>>
> >>>> Is it normal when snapshot is being run on the VM running on that
> >>>> host, the host's CPU utilisation will be higher than usual? How can
> >>>> I limit the CPU resources used by the snapshot?
> >>>>
> >>>>
> >>>> Looking forward to your reply, thank you.
> >>>>
> >>>> Cheers.
> >>>>
> >>>>
> >>>>
> >>>> On Mon, Oct 7, 2013 at 7:18 PM, Indra Pramana <in...@sg.or.id> wrote:
> >>>>
> >>>>  Dear all,
> >>>>>
> >>>>> I did some tests on snapshots since it's now supported for my Ceph
> >>>>> RBD primary storage in CloudStack 4.2. When I ran the snapshot for
> >>>>> a particular VM instance earlier, I noticed that this has caused
> >>>>> the host (where the VM is on) becomes disconnected.
> >>>>>
> >>>>> Here's the excerpt from the agent.log:
> >>>>>
> >>>>> http://pastebin.com/dxVV7stu
> >>>>>
> >>>>> The management-server.log doesn't much showing anything other
> than
> >>>>> detecting that the host was down and HA is being activated:
> >>>>>
> >>>>> http://pastebin.com/UeLiSm9K
> >>>>>
> >>>>> Anyone can advise what is causing the problem? So far there is
> >>>>> only one user doing the snapshotting and it has caused issues to
> >>>>> the host, I can't imagine what if multiple users try to do
> >>>>> snapshotting at the same time?
> >>>>>
> >>>>> I read about snapshot job throttling which is described on the manual:
> >>>>>
> >>>>>
> >>>>> http://cloudstack.apache.org/**docs/en-US/Apache_CloudStack/**
> >>>>> 4.2.0/html/Admin_Guide/**working-with-
> snapshots.html<http://clouds
> >>>>> tack.apache.org/docs/en-
> US/Apache_CloudStack/4.2.0/html/Admin_Guid
> >>>>> e/working-with-snapshots.html>
> >>>>>
> >>>>> But I am not too sure whether this will help to resolve the
> >>>>> problem since there is only one user trying to perform snapshot
> >>>>> and we already encounter the problem already.
> >>>>>
> >>>>> Anyone can advise how I can troubleshoot further and find a
> >>>>> solution to the problem?
> >>>>>
> >>>>> Looking forward to your reply, thank you.
> >>>>>
> >>>>> Cheers.
> >>>>>
> >>>>>
> >>>>
> >>>>
> >>>
> >

RE: High CPU utilization on KVM hosts while doing RBD snapshot - was Re: snapshot caused host disconnected

Posted by Edison Su <Ed...@citrix.com>.

> -----Original Message-----
> From: Indra Pramana [mailto:indra@sg.or.id]
> Sent: Wednesday, October 09, 2013 11:07 PM
> To: dev@cloudstack.apache.org; users@cloudstack.apache.org
> Subject: Re: High CPU utilization on KVM hosts while doing RBD snapshot -
> was Re: snapshot caused host disconnected
> 
> Dear all,
> 
> I and my colleague tried to scrutinize the source code and found the script
> which is performing the copying of the snapshot from primary storage to
> secondary storage on this file:
> 
> ./plugins/hypervisors/kvm/src/com/cloud/hypervisor/kvm/storage/KVMSto
> rageProcessor.java
> 
> Specifically under this function:
> 
>     @Override
>     public Answer backupSnapshot(CopyCommand cmd) {
> 
> ===
>                     File snapDir = new File(snapshotDestPath);
>                     s_logger.debug("Attempting to create " +
> snapDir.getAbsolutePath() + " recursively");
>                     FileUtils.forceMkdir(snapDir);
> 
>                     File snapFile = new File(snapshotDestPath + "/" + snapshotName);
>                     s_logger.debug("Backing up RBD snapshot to " +
> snapFile.getAbsolutePath());
>                     BufferedOutputStream bos = new BufferedOutputStream(new
> FileOutputStream(snapFile));
>                     int chunkSize = 4194304;
>                     long offset = 0;
>                     while(true) {
>                         byte[] buf = new byte[chunkSize];
> 
>                         int bytes = image.read(offset, buf, chunkSize);
>                         if (bytes <= 0) {
>                             break;
>                         }
>                         bos.write(buf, 0, bytes);
>                         offset += bytes;
>                     }
>                     s_logger.debug("Completed backing up RBD snapshot " +
> snapshotName + " to  " + snapFile.getAbsolutePath() + ". Bytes written: " +
> offset);
>                     bos.close();
> ===
> 
> (1) Is it safe to comment out the above lines and recompile/reinstall, to
> prevent CloudStack from copying the snapshots from the RBD primary
> storage to the secondary storage?
> 
> (2) What would be the impact to CloudStack operations if we leave the
> snapshots on primary storage without copying them to secondary storage?
> Are we still able to do restoration from the snapshots kept in the primary
> storage?

Yes, it's doable, but need to write a SnapshotStrategy in mgt server. The current SnapshotStrategy will always backup snapshot into secondary after taking snapshot.
You can write a new SnapshotStrategy, just take snapshot without backup to secondary storage.
After the commit: 180cfa19e87b909cb1c8a738359e31a6111b11c5 checked into master, you will get a lot of freedom to manipulate snapshot.

> 
> Looking forward to your reply, thank you.
> 
> Cheers.
> 
> 
> 
> On Wed, Oct 9, 2013 at 2:36 PM, Indra Pramana <in...@sg.or.id> wrote:
> 
> > Hi Wido and all,
> >
> > Good day to you, and thank you for your e-mail reply.
> >
> > Yes, from the agent logs I can see that the RBD snapshot was created.
> > However, it seems that CPU utilisation goes up drastically during the
> > copying over of the snapshot from primary storage to the secondary
> storage.
> >
> > ===
> > 2013-10-08 00:01:58,765 DEBUG [cloud.agent.Agent]
> > (agentRequest-Handler-5:null) Request:Seq 34-898172006:  { Cmd , MgmtId:
> > 161342671900, via: 34, Ver: v1, Flags: 100011,
> >
> [{"org.apache.cloudstack.storage.command.CreateObjectCommand":{"data"
> :{"org.apache.cloudstack.storage.to.SnapshotObjectTO":{"volume":{"uuid":"
> 0c4f8e41-dfd8-4fc2-a22e-
> 1a79738560a1","volumeType":"DATADISK","dataStore":{"org.apache.cloudst
> ack.storage.to.PrimaryDataStoreTO":{"uuid":"d433809b-01ea-3947-ba0f-
> 48077244e4d6","id":214,"poolType":"RBD","host":"
> > ceph-mon.simplercloud.com
> > ","path":"simplercloud-sg-01","port":6789}},"name":"DATA-
> 2051","size":64424509440,"path":"fc5dfa05-2431-4b42-804b-
> b2fb72e219d0","volumeId":2289,"vmName":"i-195-2051-
> VM","accountId":195,"format":"RAW","id":2289,"hypervisorType":"KVM"},"
> parentSnapshotPath":"simplercloud-sg-01/fc5dfa05-2431-4b42-804b-
> b2fb72e219d0/61042668-23ab-4f63-8a21-
> ce5a24f9c883","dataStore":{"org.apache.cloudstack.storage.to.PrimaryDataS
> toreTO":{"uuid":"d433809b-01ea-3947-ba0f-
> 48077244e4d6","id":214,"poolType":"RBD","host":"
> > ceph-mon.simplercloud.com","path":"simplercloud-sg-01","port":6789}},"
> > vmName":"i-195-2051-VM","name":"test-snapshot-and-ip-1_DATA-
> 2051_20131
> > 007160158","hypervisorType":"KVM","id":22}},"wait":0}}]
> > }
> > 2013-10-08 00:01:58,765 DEBUG [cloud.agent.Agent]
> > (agentRequest-Handler-5:null) Processing command:
> > org.apache.cloudstack.storage.command.CreateObjectCommand
> > 2013-10-08 00:02:08,071 DEBUG [cloud.agent.Agent]
> > (agentRequest-Handler-1:null) Request:Seq 34-898172007:  { Cmd , MgmtId:
> > 161342671900, via: 34, Ver: v1, Flags: 100011,
> >
> [{"org.apache.cloudstack.storage.command.CreateObjectCommand":{"data"
> :{"org.apache.cloudstack.storage.to.SnapshotObjectTO":{"volume":{"uuid":"
> 35d9bae0-1683-4a3d-9a69-
> ccefa18bf899","volumeType":"DATADISK","dataStore":{"org.apache.cloudsta
> ck.storage.to.PrimaryDataStoreTO":{"uuid":"d433809b-01ea-3947-ba0f-
> 48077244e4d6","id":214,"poolType":"RBD","host":"
> > ceph-mon.simplercloud.com
> > ","path":"simplercloud-sg-01","port":6789}},"name":"DATA-
> 2046","size":42949672960,"path":"59825284-6b60-4a37-b728-
> 755b3752a755","volumeId":2278,"vmName":"i-190-2046-
> VM","accountId":190,"format":"RAW","id":2278,"hypervisorType":"KVM"},"
> dataStore":{"org.apache.cloudstack.storage.to.PrimaryDataStoreTO":{"uuid"
> :"d433809b-01ea-3947-ba0f-
> 48077244e4d6","id":214,"poolType":"RBD","host":"
> > ceph-mon.simplercloud.com","path":"simplercloud-sg-01","port":6789}},"
> > vmName":"i-190-2046-VM","name":"test-aft-upgrade-12-win_DATA-
> 2046_2013
> > 1007160207","hypervisorType":"KVM","id":23}},"wait":0}}]
> > }
> > ...
> > 2013-10-08 00:02:08,191 DEBUG [kvm.storage.KVMStorageProcessor]
> > (agentRequest-Handler-5:null) Succesfully connected to Ceph cluster at
> > ceph-mon.simplercloud.com:6789
> > 2013-10-08 00:02:08,214 DEBUG [kvm.storage.KVMStorageProcessor]
> > (agentRequest-Handler-5:null) Attempting to create RBD snapshot
> > fc5dfa05-2431-4b42-804b-b2fb72e219d0@d6df6e15-d2ec-46a6-8b75-
> 1f94c127d
> > bb8
> > ...
> > 2013-10-08 00:02:20,821 DEBUG [kvm.storage.KVMStorageProcessor]
> > (agentRequest-Handler-1:null) Succesfully connected to Ceph cluster at
> > ceph-mon.simplercloud.com:6789
> > ...
> > 2013-10-08 00:05:19,580 DEBUG [kvm.storage.KVMStorageProcessor]
> > (agentRequest-Handler-3:null) Succesfully connected to Ceph cluster at
> > ceph-mon.simplercloud.com:6789
> > 2013-10-08 00:05:19,610 DEBUG [kvm.storage.KVMStorageProcessor]
> > (agentRequest-Handler-3:null) Attempting to create
> > /mnt/c10404d3-070e-3579-980e-cb0d40effb7b/snapshots/190/2277
> > recursively
> > 2013-10-08 00:05:20,645 DEBUG [kvm.storage.KVMStorageProcessor]
> > (agentRequest-Handler-3:null) Backing up RBD snapshot to
> > /mnt/c10404d3-070e-3579-980e-
> cb0d40effb7b/snapshots/190/2277/5c9ea7c4-
> > 03e1-455e-a785-3c96df68cf69
> > ===
> >
> > After which, depends on the size of the snapshot to be backed up, the
> > KVM host will suffer from high CPU utilisation and will sometimes
> > time-out and disconnected from the management server. Then HA will
> > kick in and everything goes haywire.
> >
> > This is what I have done:
> >
> > (1) Enabling jumbo frames (MTU 9000) on the NIC cards and the switch
> > ports where the KVM hosts and the secondary storage server are
> > connected to. We are using 10 GBps NIC cards and switches.
> >
> > (2) Performed some fine-tuning on the KVM hosts' /etc/sysctl.conf to
> > improve the network performance:
> >
> > net.ipv4.tcp_wmem = 4096 65536 16777216 net.ipv4.tcp_rmem = 4096
> 87380
> > 16777216 net.core.wmem_max = 16777216 net.core.rmem_max =
> 16777216
> > net.core.wmem_default = 65536 net.core.rmem_default = 87380
> > net.core.netdev_max_backlog = 30000
> >
> > (3) Try to disable RBD snapshot copying from primary to secondary
> > storage by setting this global Cloudstack configuration:
> >
> > snapshot.backup.rightafter = false
> >
> > However, when I tested, the snapshotting process will still try to
> > copy the snapshot backup over to the secondary storage. I checked on
> > another thread and it seems to be a bug on 4.2.0.
> >
> > Is it actually a necessity for the snapshots to be backed up to
> > secondary storage, or is it OK for the snapshots to stay in primary
> > storage? It seems that the process of copying over the snapshot from
> > primary to secondary storage is the root cause of the high CPU utilisation
> on the KVM hosts.
> >
> > Due to the above bug on point (3), is there a workaround on how we can
> > prevent the snapshots to be copied over to the secondary storage after
> > it's being created on the primary storage?
> >
> > Looking forward to your reply, thank you.
> >
> > Cheers.
> >
> >
> >
> > On Tue, Oct 8, 2013 at 2:41 PM, Wido den Hollander <wi...@widodh.nl>
> wrote:
> >
> >>
> >>
> >> On 10/08/2013 04:59 AM, Indra Pramana wrote:
> >>
> >>> Dear Wido and all,
> >>>
> >>> I performed some further tests last night:
> >>>
> >>> (1) CPU utilization of the KVM host while RBD snapshot running is
> >>> still shooting up high even after I set global setting:
> >>> concurrent.snapshots.**threshold.perhost to 2.
> >>>
> >>> (2) Most of the concurrent snapshot processes will fail with either
> >>> stuck in "Creating" state, or "CreatedOnPrimary" error message.
> >>>
> >>>
> >> Hmm, that is odd. It uses rados-java to call the RBD library to
> >> create the snapshot and afterwards it copies it to Secondary Storage.
> >>
> >> I'm leaving for the Ceph Days and the Build a Cloud Day afterwards in
> >> London now, so I won't be able to look at this the coming 2 days.
> >>
> >>
> >>  (3) I also have adjusted some other related global settings such as
> >>> backup.snapshot.wait and job.expire.minutes, without any luck.
> >>>
> >>> Any advise on the reason what causes the high CPU utilization is
> >>> greatly appreciated.
> >>>
> >>>
> >> You might want to set the Agent log to debug and see if the RBD
> >> snapshot was created, it should log that:
> >> https://git-wip-us.apache.org/**
> >> repos/asf?p=cloudstack.git;a=**blob;f=plugins/hypervisors/**
> >>
> kvm/src/com/cloud/hypervisor/**kvm/storage/**KVMStorageProcessor.ja
> va
> >> ;h=*
> >>
> *1b883519073acc7514b66857e080a4**64714c4324;hb=4.2#l1091<https://git-
> >> wip-us.apache.org/repos/asf?p=cloudstack.git;a=blob;f=plugins/hypervi
> >>
> sors/kvm/src/com/cloud/hypervisor/kvm/storage/KVMStorageProcessor.ja
> v
> >> a;h=1b883519073acc7514b66857e080a464714c4324;hb=4.2#l1091>
> >>
> >> "Attempting to create RBD snapshot"
> >>
> >> If that succeeds the problem lies with backing up the snapshot to
> >> Secondary Storage.
> >>
> >> Wido
> >>
> >>
> >>  Looking forward to your reply, thank you.
> >>>
> >>> Cheers.
> >>>
> >>>
> >>> On Mon, Oct 7, 2013 at 11:03 PM, Indra Pramana <in...@sg.or.id> wrote:
> >>>
> >>>  Dear all,
> >>>>
> >>>> I also found out that when the RBD snapshot is being run, the CPU
> >>>> utilisation on the KVM host will be shooting up very high, which
> >>>> might explain why the host becomes disconnected.
> >>>>
> >>>> top - 22:49:32 up 3 days, 19:31,  1 user,  load average: 7.85,
> >>>> 4.97,
> >>>> 3.47
> >>>> Tasks: 297 total,   3 running, 294 sleeping,   0 stopped,   0 zombie
> >>>> Cpu(s):  4.5%us,  1.2%sy,  0.0%ni, 94.1%id,  0.1%wa,  0.0%hi,
> >>>> 0.0%si, 0.0%st
> >>>> Mem:  264125244k total, 77203460k used, 186921784k free,   154888k
> >>>> buffers
> >>>> Swap:   545788k total,        0k used,   545788k free, 60677092k cached
> >>>>
> >>>>    PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+
> COMMAND
> >>>> 18161 root      20   0 3871m  31m 8444 S  101  0.0 301:58.09 kvm
> >>>>   2790 root      20   0 43.5g 1.6g  19m S   97  0.7  45:52.42 jsvc
> >>>> 24544 root      20   0 4583m  31m 8364 S   97  0.0 425:29.48 kvm
> >>>>   6537 root      20   0     0    0    0 R   71  0.0   0:17.49
> >>>> kworker/3:2
> >>>> 22546 root      20   0 6143m 2.0g 8452 S   26  0.8  55:14.07 kvm
> >>>>   4219 root      20   0 7671m 4.0g 8524 S    6  1.6 106:12.26 kvm
> >>>>   5989 root      20   0 43.2g 1.6g  232 D    6  0.6   0:08.13 jsvc
> >>>>   5993 root      20   0 43.3g 1.6g  224 D    6  0.6   0:08.36 jsvc
> >>>>
> >>>> Is it normal when snapshot is being run on the VM running on that
> >>>> host, the host's CPU utilisation will be higher than usual? How can
> >>>> I limit the CPU resources used by the snapshot?
> >>>>
> >>>>
> >>>> Looking forward to your reply, thank you.
> >>>>
> >>>> Cheers.
> >>>>
> >>>>
> >>>>
> >>>> On Mon, Oct 7, 2013 at 7:18 PM, Indra Pramana <in...@sg.or.id> wrote:
> >>>>
> >>>>  Dear all,
> >>>>>
> >>>>> I did some tests on snapshots since it's now supported for my Ceph
> >>>>> RBD primary storage in CloudStack 4.2. When I ran the snapshot for
> >>>>> a particular VM instance earlier, I noticed that this has caused
> >>>>> the host (where the VM is on) becomes disconnected.
> >>>>>
> >>>>> Here's the excerpt from the agent.log:
> >>>>>
> >>>>> http://pastebin.com/dxVV7stu
> >>>>>
> >>>>> The management-server.log doesn't much showing anything other
> than
> >>>>> detecting that the host was down and HA is being activated:
> >>>>>
> >>>>> http://pastebin.com/UeLiSm9K
> >>>>>
> >>>>> Anyone can advise what is causing the problem? So far there is
> >>>>> only one user doing the snapshotting and it has caused issues to
> >>>>> the host, I can't imagine what if multiple users try to do
> >>>>> snapshotting at the same time?
> >>>>>
> >>>>> I read about snapshot job throttling which is described on the manual:
> >>>>>
> >>>>>
> >>>>> http://cloudstack.apache.org/**docs/en-US/Apache_CloudStack/**
> >>>>> 4.2.0/html/Admin_Guide/**working-with-
> snapshots.html<http://clouds
> >>>>> tack.apache.org/docs/en-
> US/Apache_CloudStack/4.2.0/html/Admin_Guid
> >>>>> e/working-with-snapshots.html>
> >>>>>
> >>>>> But I am not too sure whether this will help to resolve the
> >>>>> problem since there is only one user trying to perform snapshot
> >>>>> and we already encounter the problem already.
> >>>>>
> >>>>> Anyone can advise how I can troubleshoot further and find a
> >>>>> solution to the problem?
> >>>>>
> >>>>> Looking forward to your reply, thank you.
> >>>>>
> >>>>> Cheers.
> >>>>>
> >>>>>
> >>>>
> >>>>
> >>>
> >

Re: High CPU utilization on KVM hosts while doing RBD snapshot - was Re: snapshot caused host disconnected

Posted by Indra Pramana <in...@sg.or.id>.
Dear all,

I and my colleague tried to scrutinize the source code and found the script
which is performing the copying of the snapshot from primary storage to
secondary storage on this file:

./plugins/hypervisors/kvm/src/com/cloud/hypervisor/kvm/storage/KVMStorageProcessor.java

Specifically under this function:

    @Override
    public Answer backupSnapshot(CopyCommand cmd) {

===
                    File snapDir = new File(snapshotDestPath);
                    s_logger.debug("Attempting to create " +
snapDir.getAbsolutePath() + " recursively");
                    FileUtils.forceMkdir(snapDir);

                    File snapFile = new File(snapshotDestPath + "/" +
snapshotName);
                    s_logger.debug("Backing up RBD snapshot to " +
snapFile.getAbsolutePath());
                    BufferedOutputStream bos = new BufferedOutputStream(new
FileOutputStream(snapFile));
                    int chunkSize = 4194304;
                    long offset = 0;
                    while(true) {
                        byte[] buf = new byte[chunkSize];

                        int bytes = image.read(offset, buf, chunkSize);
                        if (bytes <= 0) {
                            break;
                        }
                        bos.write(buf, 0, bytes);
                        offset += bytes;
                    }
                    s_logger.debug("Completed backing up RBD snapshot " +
snapshotName + " to  " + snapFile.getAbsolutePath() + ". Bytes written: " +
offset);
                    bos.close();
===

(1) Is it safe to comment out the above lines and recompile/reinstall, to
prevent CloudStack from copying the snapshots from the RBD primary storage
to the secondary storage?

(2) What would be the impact to CloudStack operations if we leave the
snapshots on primary storage without copying them to secondary storage? Are
we still able to do restoration from the snapshots kept in the primary
storage?

Looking forward to your reply, thank you.

Cheers.



On Wed, Oct 9, 2013 at 2:36 PM, Indra Pramana <in...@sg.or.id> wrote:

> Hi Wido and all,
>
> Good day to you, and thank you for your e-mail reply.
>
> Yes, from the agent logs I can see that the RBD snapshot was created.
> However, it seems that CPU utilisation goes up drastically during the
> copying over of the snapshot from primary storage to the secondary storage.
>
> ===
> 2013-10-08 00:01:58,765 DEBUG [cloud.agent.Agent]
> (agentRequest-Handler-5:null) Request:Seq 34-898172006:  { Cmd , MgmtId:
> 161342671900, via: 34, Ver: v1, Flags: 100011,
> [{"org.apache.cloudstack.storage.command.CreateObjectCommand":{"data":{"org.apache.cloudstack.storage.to.SnapshotObjectTO":{"volume":{"uuid":"0c4f8e41-dfd8-4fc2-a22e-1a79738560a1","volumeType":"DATADISK","dataStore":{"org.apache.cloudstack.storage.to.PrimaryDataStoreTO":{"uuid":"d433809b-01ea-3947-ba0f-48077244e4d6","id":214,"poolType":"RBD","host":"
> ceph-mon.simplercloud.com
> ","path":"simplercloud-sg-01","port":6789}},"name":"DATA-2051","size":64424509440,"path":"fc5dfa05-2431-4b42-804b-b2fb72e219d0","volumeId":2289,"vmName":"i-195-2051-VM","accountId":195,"format":"RAW","id":2289,"hypervisorType":"KVM"},"parentSnapshotPath":"simplercloud-sg-01/fc5dfa05-2431-4b42-804b-b2fb72e219d0/61042668-23ab-4f63-8a21-ce5a24f9c883","dataStore":{"org.apache.cloudstack.storage.to.PrimaryDataStoreTO":{"uuid":"d433809b-01ea-3947-ba0f-48077244e4d6","id":214,"poolType":"RBD","host":"
> ceph-mon.simplercloud.com","path":"simplercloud-sg-01","port":6789}},"vmName":"i-195-2051-VM","name":"test-snapshot-and-ip-1_DATA-2051_20131007160158","hypervisorType":"KVM","id":22}},"wait":0}}]
> }
> 2013-10-08 00:01:58,765 DEBUG [cloud.agent.Agent]
> (agentRequest-Handler-5:null) Processing command:
> org.apache.cloudstack.storage.command.CreateObjectCommand
> 2013-10-08 00:02:08,071 DEBUG [cloud.agent.Agent]
> (agentRequest-Handler-1:null) Request:Seq 34-898172007:  { Cmd , MgmtId:
> 161342671900, via: 34, Ver: v1, Flags: 100011,
> [{"org.apache.cloudstack.storage.command.CreateObjectCommand":{"data":{"org.apache.cloudstack.storage.to.SnapshotObjectTO":{"volume":{"uuid":"35d9bae0-1683-4a3d-9a69-ccefa18bf899","volumeType":"DATADISK","dataStore":{"org.apache.cloudstack.storage.to.PrimaryDataStoreTO":{"uuid":"d433809b-01ea-3947-ba0f-48077244e4d6","id":214,"poolType":"RBD","host":"
> ceph-mon.simplercloud.com
> ","path":"simplercloud-sg-01","port":6789}},"name":"DATA-2046","size":42949672960,"path":"59825284-6b60-4a37-b728-755b3752a755","volumeId":2278,"vmName":"i-190-2046-VM","accountId":190,"format":"RAW","id":2278,"hypervisorType":"KVM"},"dataStore":{"org.apache.cloudstack.storage.to.PrimaryDataStoreTO":{"uuid":"d433809b-01ea-3947-ba0f-48077244e4d6","id":214,"poolType":"RBD","host":"
> ceph-mon.simplercloud.com","path":"simplercloud-sg-01","port":6789}},"vmName":"i-190-2046-VM","name":"test-aft-upgrade-12-win_DATA-2046_20131007160207","hypervisorType":"KVM","id":23}},"wait":0}}]
> }
> ...
> 2013-10-08 00:02:08,191 DEBUG [kvm.storage.KVMStorageProcessor]
> (agentRequest-Handler-5:null) Succesfully connected to Ceph cluster at
> ceph-mon.simplercloud.com:6789
> 2013-10-08 00:02:08,214 DEBUG [kvm.storage.KVMStorageProcessor]
> (agentRequest-Handler-5:null) Attempting to create RBD snapshot
> fc5dfa05-2431-4b42-804b-b2fb72e219d0@d6df6e15-d2ec-46a6-8b75-1f94c127dbb8
> ...
> 2013-10-08 00:02:20,821 DEBUG [kvm.storage.KVMStorageProcessor]
> (agentRequest-Handler-1:null) Succesfully connected to Ceph cluster at
> ceph-mon.simplercloud.com:6789
> ...
> 2013-10-08 00:05:19,580 DEBUG [kvm.storage.KVMStorageProcessor]
> (agentRequest-Handler-3:null) Succesfully connected to Ceph cluster at
> ceph-mon.simplercloud.com:6789
> 2013-10-08 00:05:19,610 DEBUG [kvm.storage.KVMStorageProcessor]
> (agentRequest-Handler-3:null) Attempting to create
> /mnt/c10404d3-070e-3579-980e-cb0d40effb7b/snapshots/190/2277 recursively
> 2013-10-08 00:05:20,645 DEBUG [kvm.storage.KVMStorageProcessor]
> (agentRequest-Handler-3:null) Backing up RBD snapshot to
> /mnt/c10404d3-070e-3579-980e-cb0d40effb7b/snapshots/190/2277/5c9ea7c4-03e1-455e-a785-3c96df68cf69
> ===
>
> After which, depends on the size of the snapshot to be backed up, the KVM
> host will suffer from high CPU utilisation and will sometimes time-out and
> disconnected from the management server. Then HA will kick in and
> everything goes haywire.
>
> This is what I have done:
>
> (1) Enabling jumbo frames (MTU 9000) on the NIC cards and the switch ports
> where the KVM hosts and the secondary storage server are connected to. We
> are using 10 GBps NIC cards and switches.
>
> (2) Performed some fine-tuning on the KVM hosts' /etc/sysctl.conf to
> improve the network performance:
>
> net.ipv4.tcp_wmem = 4096 65536 16777216
> net.ipv4.tcp_rmem = 4096 87380 16777216
> net.core.wmem_max = 16777216
> net.core.rmem_max = 16777216
> net.core.wmem_default = 65536
> net.core.rmem_default = 87380
> net.core.netdev_max_backlog = 30000
>
> (3) Try to disable RBD snapshot copying from primary to secondary storage
> by setting this global Cloudstack configuration:
>
> snapshot.backup.rightafter = false
>
> However, when I tested, the snapshotting process will still try to copy
> the snapshot backup over to the secondary storage. I checked on another
> thread and it seems to be a bug on 4.2.0.
>
> Is it actually a necessity for the snapshots to be backed up to secondary
> storage, or is it OK for the snapshots to stay in primary storage? It seems
> that the process of copying over the snapshot from primary to secondary
> storage is the root cause of the high CPU utilisation on the KVM hosts.
>
> Due to the above bug on point (3), is there a workaround on how we can
> prevent the snapshots to be copied over to the secondary storage after it's
> being created on the primary storage?
>
> Looking forward to your reply, thank you.
>
> Cheers.
>
>
>
> On Tue, Oct 8, 2013 at 2:41 PM, Wido den Hollander <wi...@widodh.nl> wrote:
>
>>
>>
>> On 10/08/2013 04:59 AM, Indra Pramana wrote:
>>
>>> Dear Wido and all,
>>>
>>> I performed some further tests last night:
>>>
>>> (1) CPU utilization of the KVM host while RBD snapshot running is still
>>> shooting up high even after I set global setting:
>>> concurrent.snapshots.**threshold.perhost to 2.
>>>
>>> (2) Most of the concurrent snapshot processes will fail with either stuck
>>> in "Creating" state, or "CreatedOnPrimary" error message.
>>>
>>>
>> Hmm, that is odd. It uses rados-java to call the RBD library to create
>> the snapshot and afterwards it copies it to Secondary Storage.
>>
>> I'm leaving for the Ceph Days and the Build a Cloud Day afterwards in
>> London now, so I won't be able to look at this the coming 2 days.
>>
>>
>>  (3) I also have adjusted some other related global settings such as
>>> backup.snapshot.wait and job.expire.minutes, without any luck.
>>>
>>> Any advise on the reason what causes the high CPU utilization is greatly
>>> appreciated.
>>>
>>>
>> You might want to set the Agent log to debug and see if the RBD snapshot
>> was created, it should log that: https://git-wip-us.apache.org/**
>> repos/asf?p=cloudstack.git;a=**blob;f=plugins/hypervisors/**
>> kvm/src/com/cloud/hypervisor/**kvm/storage/**KVMStorageProcessor.java;h=*
>> *1b883519073acc7514b66857e080a4**64714c4324;hb=4.2#l1091<https://git-wip-us.apache.org/repos/asf?p=cloudstack.git;a=blob;f=plugins/hypervisors/kvm/src/com/cloud/hypervisor/kvm/storage/KVMStorageProcessor.java;h=1b883519073acc7514b66857e080a464714c4324;hb=4.2#l1091>
>>
>> "Attempting to create RBD snapshot"
>>
>> If that succeeds the problem lies with backing up the snapshot to
>> Secondary Storage.
>>
>> Wido
>>
>>
>>  Looking forward to your reply, thank you.
>>>
>>> Cheers.
>>>
>>>
>>> On Mon, Oct 7, 2013 at 11:03 PM, Indra Pramana <in...@sg.or.id> wrote:
>>>
>>>  Dear all,
>>>>
>>>> I also found out that when the RBD snapshot is being run, the CPU
>>>> utilisation on the KVM host will be shooting up very high, which might
>>>> explain why the host becomes disconnected.
>>>>
>>>> top - 22:49:32 up 3 days, 19:31,  1 user,  load average: 7.85, 4.97,
>>>> 3.47
>>>> Tasks: 297 total,   3 running, 294 sleeping,   0 stopped,   0 zombie
>>>> Cpu(s):  4.5%us,  1.2%sy,  0.0%ni, 94.1%id,  0.1%wa,  0.0%hi,  0.0%si,
>>>> 0.0%st
>>>> Mem:  264125244k total, 77203460k used, 186921784k free,   154888k
>>>> buffers
>>>> Swap:   545788k total,        0k used,   545788k free, 60677092k cached
>>>>
>>>>    PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
>>>> 18161 root      20   0 3871m  31m 8444 S  101  0.0 301:58.09 kvm
>>>>   2790 root      20   0 43.5g 1.6g  19m S   97  0.7  45:52.42 jsvc
>>>> 24544 root      20   0 4583m  31m 8364 S   97  0.0 425:29.48 kvm
>>>>   6537 root      20   0     0    0    0 R   71  0.0   0:17.49
>>>> kworker/3:2
>>>> 22546 root      20   0 6143m 2.0g 8452 S   26  0.8  55:14.07 kvm
>>>>   4219 root      20   0 7671m 4.0g 8524 S    6  1.6 106:12.26 kvm
>>>>   5989 root      20   0 43.2g 1.6g  232 D    6  0.6   0:08.13 jsvc
>>>>   5993 root      20   0 43.3g 1.6g  224 D    6  0.6   0:08.36 jsvc
>>>>
>>>> Is it normal when snapshot is being run on the VM running on that host,
>>>> the host's CPU utilisation will be higher than usual? How can I limit
>>>> the
>>>> CPU resources used by the snapshot?
>>>>
>>>>
>>>> Looking forward to your reply, thank you.
>>>>
>>>> Cheers.
>>>>
>>>>
>>>>
>>>> On Mon, Oct 7, 2013 at 7:18 PM, Indra Pramana <in...@sg.or.id> wrote:
>>>>
>>>>  Dear all,
>>>>>
>>>>> I did some tests on snapshots since it's now supported for my Ceph RBD
>>>>> primary storage in CloudStack 4.2. When I ran the snapshot for a
>>>>> particular
>>>>> VM instance earlier, I noticed that this has caused the host (where
>>>>> the VM
>>>>> is on) becomes disconnected.
>>>>>
>>>>> Here's the excerpt from the agent.log:
>>>>>
>>>>> http://pastebin.com/dxVV7stu
>>>>>
>>>>> The management-server.log doesn't much showing anything other than
>>>>> detecting that the host was down and HA is being activated:
>>>>>
>>>>> http://pastebin.com/UeLiSm9K
>>>>>
>>>>> Anyone can advise what is causing the problem? So far there is only one
>>>>> user doing the snapshotting and it has caused issues to the host, I
>>>>> can't
>>>>> imagine what if multiple users try to do snapshotting at the same time?
>>>>>
>>>>> I read about snapshot job throttling which is described on the manual:
>>>>>
>>>>>
>>>>> http://cloudstack.apache.org/**docs/en-US/Apache_CloudStack/**
>>>>> 4.2.0/html/Admin_Guide/**working-with-snapshots.html<http://cloudstack.apache.org/docs/en-US/Apache_CloudStack/4.2.0/html/Admin_Guide/working-with-snapshots.html>
>>>>>
>>>>> But I am not too sure whether this will help to resolve the problem
>>>>> since
>>>>> there is only one user trying to perform snapshot and we already
>>>>> encounter
>>>>> the problem already.
>>>>>
>>>>> Anyone can advise how I can troubleshoot further and find a solution to
>>>>> the problem?
>>>>>
>>>>> Looking forward to your reply, thank you.
>>>>>
>>>>> Cheers.
>>>>>
>>>>>
>>>>
>>>>
>>>
>

Re: High CPU utilization on KVM hosts while doing RBD snapshot - was Re: snapshot caused host disconnected

Posted by Indra Pramana <in...@sg.or.id>.
Dear all,

I and my colleague tried to scrutinize the source code and found the script
which is performing the copying of the snapshot from primary storage to
secondary storage on this file:

./plugins/hypervisors/kvm/src/com/cloud/hypervisor/kvm/storage/KVMStorageProcessor.java

Specifically under this function:

    @Override
    public Answer backupSnapshot(CopyCommand cmd) {

===
                    File snapDir = new File(snapshotDestPath);
                    s_logger.debug("Attempting to create " +
snapDir.getAbsolutePath() + " recursively");
                    FileUtils.forceMkdir(snapDir);

                    File snapFile = new File(snapshotDestPath + "/" +
snapshotName);
                    s_logger.debug("Backing up RBD snapshot to " +
snapFile.getAbsolutePath());
                    BufferedOutputStream bos = new BufferedOutputStream(new
FileOutputStream(snapFile));
                    int chunkSize = 4194304;
                    long offset = 0;
                    while(true) {
                        byte[] buf = new byte[chunkSize];

                        int bytes = image.read(offset, buf, chunkSize);
                        if (bytes <= 0) {
                            break;
                        }
                        bos.write(buf, 0, bytes);
                        offset += bytes;
                    }
                    s_logger.debug("Completed backing up RBD snapshot " +
snapshotName + " to  " + snapFile.getAbsolutePath() + ". Bytes written: " +
offset);
                    bos.close();
===

(1) Is it safe to comment out the above lines and recompile/reinstall, to
prevent CloudStack from copying the snapshots from the RBD primary storage
to the secondary storage?

(2) What would be the impact to CloudStack operations if we leave the
snapshots on primary storage without copying them to secondary storage? Are
we still able to do restoration from the snapshots kept in the primary
storage?

Looking forward to your reply, thank you.

Cheers.



On Wed, Oct 9, 2013 at 2:36 PM, Indra Pramana <in...@sg.or.id> wrote:

> Hi Wido and all,
>
> Good day to you, and thank you for your e-mail reply.
>
> Yes, from the agent logs I can see that the RBD snapshot was created.
> However, it seems that CPU utilisation goes up drastically during the
> copying over of the snapshot from primary storage to the secondary storage.
>
> ===
> 2013-10-08 00:01:58,765 DEBUG [cloud.agent.Agent]
> (agentRequest-Handler-5:null) Request:Seq 34-898172006:  { Cmd , MgmtId:
> 161342671900, via: 34, Ver: v1, Flags: 100011,
> [{"org.apache.cloudstack.storage.command.CreateObjectCommand":{"data":{"org.apache.cloudstack.storage.to.SnapshotObjectTO":{"volume":{"uuid":"0c4f8e41-dfd8-4fc2-a22e-1a79738560a1","volumeType":"DATADISK","dataStore":{"org.apache.cloudstack.storage.to.PrimaryDataStoreTO":{"uuid":"d433809b-01ea-3947-ba0f-48077244e4d6","id":214,"poolType":"RBD","host":"
> ceph-mon.simplercloud.com
> ","path":"simplercloud-sg-01","port":6789}},"name":"DATA-2051","size":64424509440,"path":"fc5dfa05-2431-4b42-804b-b2fb72e219d0","volumeId":2289,"vmName":"i-195-2051-VM","accountId":195,"format":"RAW","id":2289,"hypervisorType":"KVM"},"parentSnapshotPath":"simplercloud-sg-01/fc5dfa05-2431-4b42-804b-b2fb72e219d0/61042668-23ab-4f63-8a21-ce5a24f9c883","dataStore":{"org.apache.cloudstack.storage.to.PrimaryDataStoreTO":{"uuid":"d433809b-01ea-3947-ba0f-48077244e4d6","id":214,"poolType":"RBD","host":"
> ceph-mon.simplercloud.com","path":"simplercloud-sg-01","port":6789}},"vmName":"i-195-2051-VM","name":"test-snapshot-and-ip-1_DATA-2051_20131007160158","hypervisorType":"KVM","id":22}},"wait":0}}]
> }
> 2013-10-08 00:01:58,765 DEBUG [cloud.agent.Agent]
> (agentRequest-Handler-5:null) Processing command:
> org.apache.cloudstack.storage.command.CreateObjectCommand
> 2013-10-08 00:02:08,071 DEBUG [cloud.agent.Agent]
> (agentRequest-Handler-1:null) Request:Seq 34-898172007:  { Cmd , MgmtId:
> 161342671900, via: 34, Ver: v1, Flags: 100011,
> [{"org.apache.cloudstack.storage.command.CreateObjectCommand":{"data":{"org.apache.cloudstack.storage.to.SnapshotObjectTO":{"volume":{"uuid":"35d9bae0-1683-4a3d-9a69-ccefa18bf899","volumeType":"DATADISK","dataStore":{"org.apache.cloudstack.storage.to.PrimaryDataStoreTO":{"uuid":"d433809b-01ea-3947-ba0f-48077244e4d6","id":214,"poolType":"RBD","host":"
> ceph-mon.simplercloud.com
> ","path":"simplercloud-sg-01","port":6789}},"name":"DATA-2046","size":42949672960,"path":"59825284-6b60-4a37-b728-755b3752a755","volumeId":2278,"vmName":"i-190-2046-VM","accountId":190,"format":"RAW","id":2278,"hypervisorType":"KVM"},"dataStore":{"org.apache.cloudstack.storage.to.PrimaryDataStoreTO":{"uuid":"d433809b-01ea-3947-ba0f-48077244e4d6","id":214,"poolType":"RBD","host":"
> ceph-mon.simplercloud.com","path":"simplercloud-sg-01","port":6789}},"vmName":"i-190-2046-VM","name":"test-aft-upgrade-12-win_DATA-2046_20131007160207","hypervisorType":"KVM","id":23}},"wait":0}}]
> }
> ...
> 2013-10-08 00:02:08,191 DEBUG [kvm.storage.KVMStorageProcessor]
> (agentRequest-Handler-5:null) Succesfully connected to Ceph cluster at
> ceph-mon.simplercloud.com:6789
> 2013-10-08 00:02:08,214 DEBUG [kvm.storage.KVMStorageProcessor]
> (agentRequest-Handler-5:null) Attempting to create RBD snapshot
> fc5dfa05-2431-4b42-804b-b2fb72e219d0@d6df6e15-d2ec-46a6-8b75-1f94c127dbb8
> ...
> 2013-10-08 00:02:20,821 DEBUG [kvm.storage.KVMStorageProcessor]
> (agentRequest-Handler-1:null) Succesfully connected to Ceph cluster at
> ceph-mon.simplercloud.com:6789
> ...
> 2013-10-08 00:05:19,580 DEBUG [kvm.storage.KVMStorageProcessor]
> (agentRequest-Handler-3:null) Succesfully connected to Ceph cluster at
> ceph-mon.simplercloud.com:6789
> 2013-10-08 00:05:19,610 DEBUG [kvm.storage.KVMStorageProcessor]
> (agentRequest-Handler-3:null) Attempting to create
> /mnt/c10404d3-070e-3579-980e-cb0d40effb7b/snapshots/190/2277 recursively
> 2013-10-08 00:05:20,645 DEBUG [kvm.storage.KVMStorageProcessor]
> (agentRequest-Handler-3:null) Backing up RBD snapshot to
> /mnt/c10404d3-070e-3579-980e-cb0d40effb7b/snapshots/190/2277/5c9ea7c4-03e1-455e-a785-3c96df68cf69
> ===
>
> After which, depends on the size of the snapshot to be backed up, the KVM
> host will suffer from high CPU utilisation and will sometimes time-out and
> disconnected from the management server. Then HA will kick in and
> everything goes haywire.
>
> This is what I have done:
>
> (1) Enabling jumbo frames (MTU 9000) on the NIC cards and the switch ports
> where the KVM hosts and the secondary storage server are connected to. We
> are using 10 GBps NIC cards and switches.
>
> (2) Performed some fine-tuning on the KVM hosts' /etc/sysctl.conf to
> improve the network performance:
>
> net.ipv4.tcp_wmem = 4096 65536 16777216
> net.ipv4.tcp_rmem = 4096 87380 16777216
> net.core.wmem_max = 16777216
> net.core.rmem_max = 16777216
> net.core.wmem_default = 65536
> net.core.rmem_default = 87380
> net.core.netdev_max_backlog = 30000
>
> (3) Try to disable RBD snapshot copying from primary to secondary storage
> by setting this global Cloudstack configuration:
>
> snapshot.backup.rightafter = false
>
> However, when I tested, the snapshotting process will still try to copy
> the snapshot backup over to the secondary storage. I checked on another
> thread and it seems to be a bug on 4.2.0.
>
> Is it actually a necessity for the snapshots to be backed up to secondary
> storage, or is it OK for the snapshots to stay in primary storage? It seems
> that the process of copying over the snapshot from primary to secondary
> storage is the root cause of the high CPU utilisation on the KVM hosts.
>
> Due to the above bug on point (3), is there a workaround on how we can
> prevent the snapshots to be copied over to the secondary storage after it's
> being created on the primary storage?
>
> Looking forward to your reply, thank you.
>
> Cheers.
>
>
>
> On Tue, Oct 8, 2013 at 2:41 PM, Wido den Hollander <wi...@widodh.nl> wrote:
>
>>
>>
>> On 10/08/2013 04:59 AM, Indra Pramana wrote:
>>
>>> Dear Wido and all,
>>>
>>> I performed some further tests last night:
>>>
>>> (1) CPU utilization of the KVM host while RBD snapshot running is still
>>> shooting up high even after I set global setting:
>>> concurrent.snapshots.**threshold.perhost to 2.
>>>
>>> (2) Most of the concurrent snapshot processes will fail with either stuck
>>> in "Creating" state, or "CreatedOnPrimary" error message.
>>>
>>>
>> Hmm, that is odd. It uses rados-java to call the RBD library to create
>> the snapshot and afterwards it copies it to Secondary Storage.
>>
>> I'm leaving for the Ceph Days and the Build a Cloud Day afterwards in
>> London now, so I won't be able to look at this the coming 2 days.
>>
>>
>>  (3) I also have adjusted some other related global settings such as
>>> backup.snapshot.wait and job.expire.minutes, without any luck.
>>>
>>> Any advise on the reason what causes the high CPU utilization is greatly
>>> appreciated.
>>>
>>>
>> You might want to set the Agent log to debug and see if the RBD snapshot
>> was created, it should log that: https://git-wip-us.apache.org/**
>> repos/asf?p=cloudstack.git;a=**blob;f=plugins/hypervisors/**
>> kvm/src/com/cloud/hypervisor/**kvm/storage/**KVMStorageProcessor.java;h=*
>> *1b883519073acc7514b66857e080a4**64714c4324;hb=4.2#l1091<https://git-wip-us.apache.org/repos/asf?p=cloudstack.git;a=blob;f=plugins/hypervisors/kvm/src/com/cloud/hypervisor/kvm/storage/KVMStorageProcessor.java;h=1b883519073acc7514b66857e080a464714c4324;hb=4.2#l1091>
>>
>> "Attempting to create RBD snapshot"
>>
>> If that succeeds the problem lies with backing up the snapshot to
>> Secondary Storage.
>>
>> Wido
>>
>>
>>  Looking forward to your reply, thank you.
>>>
>>> Cheers.
>>>
>>>
>>> On Mon, Oct 7, 2013 at 11:03 PM, Indra Pramana <in...@sg.or.id> wrote:
>>>
>>>  Dear all,
>>>>
>>>> I also found out that when the RBD snapshot is being run, the CPU
>>>> utilisation on the KVM host will be shooting up very high, which might
>>>> explain why the host becomes disconnected.
>>>>
>>>> top - 22:49:32 up 3 days, 19:31,  1 user,  load average: 7.85, 4.97,
>>>> 3.47
>>>> Tasks: 297 total,   3 running, 294 sleeping,   0 stopped,   0 zombie
>>>> Cpu(s):  4.5%us,  1.2%sy,  0.0%ni, 94.1%id,  0.1%wa,  0.0%hi,  0.0%si,
>>>> 0.0%st
>>>> Mem:  264125244k total, 77203460k used, 186921784k free,   154888k
>>>> buffers
>>>> Swap:   545788k total,        0k used,   545788k free, 60677092k cached
>>>>
>>>>    PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
>>>> 18161 root      20   0 3871m  31m 8444 S  101  0.0 301:58.09 kvm
>>>>   2790 root      20   0 43.5g 1.6g  19m S   97  0.7  45:52.42 jsvc
>>>> 24544 root      20   0 4583m  31m 8364 S   97  0.0 425:29.48 kvm
>>>>   6537 root      20   0     0    0    0 R   71  0.0   0:17.49
>>>> kworker/3:2
>>>> 22546 root      20   0 6143m 2.0g 8452 S   26  0.8  55:14.07 kvm
>>>>   4219 root      20   0 7671m 4.0g 8524 S    6  1.6 106:12.26 kvm
>>>>   5989 root      20   0 43.2g 1.6g  232 D    6  0.6   0:08.13 jsvc
>>>>   5993 root      20   0 43.3g 1.6g  224 D    6  0.6   0:08.36 jsvc
>>>>
>>>> Is it normal when snapshot is being run on the VM running on that host,
>>>> the host's CPU utilisation will be higher than usual? How can I limit
>>>> the
>>>> CPU resources used by the snapshot?
>>>>
>>>>
>>>> Looking forward to your reply, thank you.
>>>>
>>>> Cheers.
>>>>
>>>>
>>>>
>>>> On Mon, Oct 7, 2013 at 7:18 PM, Indra Pramana <in...@sg.or.id> wrote:
>>>>
>>>>  Dear all,
>>>>>
>>>>> I did some tests on snapshots since it's now supported for my Ceph RBD
>>>>> primary storage in CloudStack 4.2. When I ran the snapshot for a
>>>>> particular
>>>>> VM instance earlier, I noticed that this has caused the host (where
>>>>> the VM
>>>>> is on) becomes disconnected.
>>>>>
>>>>> Here's the excerpt from the agent.log:
>>>>>
>>>>> http://pastebin.com/dxVV7stu
>>>>>
>>>>> The management-server.log doesn't much showing anything other than
>>>>> detecting that the host was down and HA is being activated:
>>>>>
>>>>> http://pastebin.com/UeLiSm9K
>>>>>
>>>>> Anyone can advise what is causing the problem? So far there is only one
>>>>> user doing the snapshotting and it has caused issues to the host, I
>>>>> can't
>>>>> imagine what if multiple users try to do snapshotting at the same time?
>>>>>
>>>>> I read about snapshot job throttling which is described on the manual:
>>>>>
>>>>>
>>>>> http://cloudstack.apache.org/**docs/en-US/Apache_CloudStack/**
>>>>> 4.2.0/html/Admin_Guide/**working-with-snapshots.html<http://cloudstack.apache.org/docs/en-US/Apache_CloudStack/4.2.0/html/Admin_Guide/working-with-snapshots.html>
>>>>>
>>>>> But I am not too sure whether this will help to resolve the problem
>>>>> since
>>>>> there is only one user trying to perform snapshot and we already
>>>>> encounter
>>>>> the problem already.
>>>>>
>>>>> Anyone can advise how I can troubleshoot further and find a solution to
>>>>> the problem?
>>>>>
>>>>> Looking forward to your reply, thank you.
>>>>>
>>>>> Cheers.
>>>>>
>>>>>
>>>>
>>>>
>>>
>

Re: High CPU utilization on KVM hosts while doing RBD snapshot - was Re: snapshot caused host disconnected

Posted by Indra Pramana <in...@sg.or.id>.
Hi Wido and all,

Good day to you, and thank you for your e-mail reply.

Yes, from the agent logs I can see that the RBD snapshot was created.
However, it seems that CPU utilisation goes up drastically during the
copying over of the snapshot from primary storage to the secondary storage.

===
2013-10-08 00:01:58,765 DEBUG [cloud.agent.Agent]
(agentRequest-Handler-5:null) Request:Seq 34-898172006:  { Cmd , MgmtId:
161342671900, via: 34, Ver: v1, Flags: 100011,
[{"org.apache.cloudstack.storage.command.CreateObjectCommand":{"data":{"org.apache.cloudstack.storage.to.SnapshotObjectTO":{"volume":{"uuid":"0c4f8e41-dfd8-4fc2-a22e-1a79738560a1","volumeType":"DATADISK","dataStore":{"org.apache.cloudstack.storage.to.PrimaryDataStoreTO":{"uuid":"d433809b-01ea-3947-ba0f-48077244e4d6","id":214,"poolType":"RBD","host":"
ceph-mon.simplercloud.com
","path":"simplercloud-sg-01","port":6789}},"name":"DATA-2051","size":64424509440,"path":"fc5dfa05-2431-4b42-804b-b2fb72e219d0","volumeId":2289,"vmName":"i-195-2051-VM","accountId":195,"format":"RAW","id":2289,"hypervisorType":"KVM"},"parentSnapshotPath":"simplercloud-sg-01/fc5dfa05-2431-4b42-804b-b2fb72e219d0/61042668-23ab-4f63-8a21-ce5a24f9c883","dataStore":{"org.apache.cloudstack.storage.to.PrimaryDataStoreTO":{"uuid":"d433809b-01ea-3947-ba0f-48077244e4d6","id":214,"poolType":"RBD","host":"
ceph-mon.simplercloud.com","path":"simplercloud-sg-01","port":6789}},"vmName":"i-195-2051-VM","name":"test-snapshot-and-ip-1_DATA-2051_20131007160158","hypervisorType":"KVM","id":22}},"wait":0}}]
}
2013-10-08 00:01:58,765 DEBUG [cloud.agent.Agent]
(agentRequest-Handler-5:null) Processing command:
org.apache.cloudstack.storage.command.CreateObjectCommand
2013-10-08 00:02:08,071 DEBUG [cloud.agent.Agent]
(agentRequest-Handler-1:null) Request:Seq 34-898172007:  { Cmd , MgmtId:
161342671900, via: 34, Ver: v1, Flags: 100011,
[{"org.apache.cloudstack.storage.command.CreateObjectCommand":{"data":{"org.apache.cloudstack.storage.to.SnapshotObjectTO":{"volume":{"uuid":"35d9bae0-1683-4a3d-9a69-ccefa18bf899","volumeType":"DATADISK","dataStore":{"org.apache.cloudstack.storage.to.PrimaryDataStoreTO":{"uuid":"d433809b-01ea-3947-ba0f-48077244e4d6","id":214,"poolType":"RBD","host":"
ceph-mon.simplercloud.com
","path":"simplercloud-sg-01","port":6789}},"name":"DATA-2046","size":42949672960,"path":"59825284-6b60-4a37-b728-755b3752a755","volumeId":2278,"vmName":"i-190-2046-VM","accountId":190,"format":"RAW","id":2278,"hypervisorType":"KVM"},"dataStore":{"org.apache.cloudstack.storage.to.PrimaryDataStoreTO":{"uuid":"d433809b-01ea-3947-ba0f-48077244e4d6","id":214,"poolType":"RBD","host":"
ceph-mon.simplercloud.com","path":"simplercloud-sg-01","port":6789}},"vmName":"i-190-2046-VM","name":"test-aft-upgrade-12-win_DATA-2046_20131007160207","hypervisorType":"KVM","id":23}},"wait":0}}]
}
...
2013-10-08 00:02:08,191 DEBUG [kvm.storage.KVMStorageProcessor]
(agentRequest-Handler-5:null) Succesfully connected to Ceph cluster at
ceph-mon.simplercloud.com:6789
2013-10-08 00:02:08,214 DEBUG [kvm.storage.KVMStorageProcessor]
(agentRequest-Handler-5:null) Attempting to create RBD snapshot
fc5dfa05-2431-4b42-804b-b2fb72e219d0@d6df6e15-d2ec-46a6-8b75-1f94c127dbb8
...
2013-10-08 00:02:20,821 DEBUG [kvm.storage.KVMStorageProcessor]
(agentRequest-Handler-1:null) Succesfully connected to Ceph cluster at
ceph-mon.simplercloud.com:6789
...
2013-10-08 00:05:19,580 DEBUG [kvm.storage.KVMStorageProcessor]
(agentRequest-Handler-3:null) Succesfully connected to Ceph cluster at
ceph-mon.simplercloud.com:6789
2013-10-08 00:05:19,610 DEBUG [kvm.storage.KVMStorageProcessor]
(agentRequest-Handler-3:null) Attempting to create
/mnt/c10404d3-070e-3579-980e-cb0d40effb7b/snapshots/190/2277 recursively
2013-10-08 00:05:20,645 DEBUG [kvm.storage.KVMStorageProcessor]
(agentRequest-Handler-3:null) Backing up RBD snapshot to
/mnt/c10404d3-070e-3579-980e-cb0d40effb7b/snapshots/190/2277/5c9ea7c4-03e1-455e-a785-3c96df68cf69
===

After which, depends on the size of the snapshot to be backed up, the KVM
host will suffer from high CPU utilisation and will sometimes time-out and
disconnected from the management server. Then HA will kick in and
everything goes haywire.

This is what I have done:

(1) Enabling jumbo frames (MTU 9000) on the NIC cards and the switch ports
where the KVM hosts and the secondary storage server are connected to. We
are using 10 GBps NIC cards and switches.

(2) Performed some fine-tuning on the KVM hosts' /etc/sysctl.conf to
improve the network performance:

net.ipv4.tcp_wmem = 4096 65536 16777216
net.ipv4.tcp_rmem = 4096 87380 16777216
net.core.wmem_max = 16777216
net.core.rmem_max = 16777216
net.core.wmem_default = 65536
net.core.rmem_default = 87380
net.core.netdev_max_backlog = 30000

(3) Try to disable RBD snapshot copying from primary to secondary storage
by setting this global Cloudstack configuration:

snapshot.backup.rightafter = false

However, when I tested, the snapshotting process will still try to copy the
snapshot backup over to the secondary storage. I checked on another thread
and it seems to be a bug on 4.2.0.

Is it actually a necessity for the snapshots to be backed up to secondary
storage, or is it OK for the snapshots to stay in primary storage? It seems
that the process of copying over the snapshot from primary to secondary
storage is the root cause of the high CPU utilisation on the KVM hosts.

Due to the above bug on point (3), is there a workaround on how we can
prevent the snapshots to be copied over to the secondary storage after it's
being created on the primary storage?

Looking forward to your reply, thank you.

Cheers.



On Tue, Oct 8, 2013 at 2:41 PM, Wido den Hollander <wi...@widodh.nl> wrote:

>
>
> On 10/08/2013 04:59 AM, Indra Pramana wrote:
>
>> Dear Wido and all,
>>
>> I performed some further tests last night:
>>
>> (1) CPU utilization of the KVM host while RBD snapshot running is still
>> shooting up high even after I set global setting:
>> concurrent.snapshots.**threshold.perhost to 2.
>>
>> (2) Most of the concurrent snapshot processes will fail with either stuck
>> in "Creating" state, or "CreatedOnPrimary" error message.
>>
>>
> Hmm, that is odd. It uses rados-java to call the RBD library to create the
> snapshot and afterwards it copies it to Secondary Storage.
>
> I'm leaving for the Ceph Days and the Build a Cloud Day afterwards in
> London now, so I won't be able to look at this the coming 2 days.
>
>
>  (3) I also have adjusted some other related global settings such as
>> backup.snapshot.wait and job.expire.minutes, without any luck.
>>
>> Any advise on the reason what causes the high CPU utilization is greatly
>> appreciated.
>>
>>
> You might want to set the Agent log to debug and see if the RBD snapshot
> was created, it should log that: https://git-wip-us.apache.org/**
> repos/asf?p=cloudstack.git;a=**blob;f=plugins/hypervisors/**
> kvm/src/com/cloud/hypervisor/**kvm/storage/**KVMStorageProcessor.java;h=**
> 1b883519073acc7514b66857e080a4**64714c4324;hb=4.2#l1091<https://git-wip-us.apache.org/repos/asf?p=cloudstack.git;a=blob;f=plugins/hypervisors/kvm/src/com/cloud/hypervisor/kvm/storage/KVMStorageProcessor.java;h=1b883519073acc7514b66857e080a464714c4324;hb=4.2#l1091>
>
> "Attempting to create RBD snapshot"
>
> If that succeeds the problem lies with backing up the snapshot to
> Secondary Storage.
>
> Wido
>
>
>  Looking forward to your reply, thank you.
>>
>> Cheers.
>>
>>
>> On Mon, Oct 7, 2013 at 11:03 PM, Indra Pramana <in...@sg.or.id> wrote:
>>
>>  Dear all,
>>>
>>> I also found out that when the RBD snapshot is being run, the CPU
>>> utilisation on the KVM host will be shooting up very high, which might
>>> explain why the host becomes disconnected.
>>>
>>> top - 22:49:32 up 3 days, 19:31,  1 user,  load average: 7.85, 4.97, 3.47
>>> Tasks: 297 total,   3 running, 294 sleeping,   0 stopped,   0 zombie
>>> Cpu(s):  4.5%us,  1.2%sy,  0.0%ni, 94.1%id,  0.1%wa,  0.0%hi,  0.0%si,
>>> 0.0%st
>>> Mem:  264125244k total, 77203460k used, 186921784k free,   154888k
>>> buffers
>>> Swap:   545788k total,        0k used,   545788k free, 60677092k cached
>>>
>>>    PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
>>> 18161 root      20   0 3871m  31m 8444 S  101  0.0 301:58.09 kvm
>>>   2790 root      20   0 43.5g 1.6g  19m S   97  0.7  45:52.42 jsvc
>>> 24544 root      20   0 4583m  31m 8364 S   97  0.0 425:29.48 kvm
>>>   6537 root      20   0     0    0    0 R   71  0.0   0:17.49 kworker/3:2
>>> 22546 root      20   0 6143m 2.0g 8452 S   26  0.8  55:14.07 kvm
>>>   4219 root      20   0 7671m 4.0g 8524 S    6  1.6 106:12.26 kvm
>>>   5989 root      20   0 43.2g 1.6g  232 D    6  0.6   0:08.13 jsvc
>>>   5993 root      20   0 43.3g 1.6g  224 D    6  0.6   0:08.36 jsvc
>>>
>>> Is it normal when snapshot is being run on the VM running on that host,
>>> the host's CPU utilisation will be higher than usual? How can I limit the
>>> CPU resources used by the snapshot?
>>>
>>>
>>> Looking forward to your reply, thank you.
>>>
>>> Cheers.
>>>
>>>
>>>
>>> On Mon, Oct 7, 2013 at 7:18 PM, Indra Pramana <in...@sg.or.id> wrote:
>>>
>>>  Dear all,
>>>>
>>>> I did some tests on snapshots since it's now supported for my Ceph RBD
>>>> primary storage in CloudStack 4.2. When I ran the snapshot for a
>>>> particular
>>>> VM instance earlier, I noticed that this has caused the host (where the
>>>> VM
>>>> is on) becomes disconnected.
>>>>
>>>> Here's the excerpt from the agent.log:
>>>>
>>>> http://pastebin.com/dxVV7stu
>>>>
>>>> The management-server.log doesn't much showing anything other than
>>>> detecting that the host was down and HA is being activated:
>>>>
>>>> http://pastebin.com/UeLiSm9K
>>>>
>>>> Anyone can advise what is causing the problem? So far there is only one
>>>> user doing the snapshotting and it has caused issues to the host, I
>>>> can't
>>>> imagine what if multiple users try to do snapshotting at the same time?
>>>>
>>>> I read about snapshot job throttling which is described on the manual:
>>>>
>>>>
>>>> http://cloudstack.apache.org/**docs/en-US/Apache_CloudStack/**
>>>> 4.2.0/html/Admin_Guide/**working-with-snapshots.html<http://cloudstack.apache.org/docs/en-US/Apache_CloudStack/4.2.0/html/Admin_Guide/working-with-snapshots.html>
>>>>
>>>> But I am not too sure whether this will help to resolve the problem
>>>> since
>>>> there is only one user trying to perform snapshot and we already
>>>> encounter
>>>> the problem already.
>>>>
>>>> Anyone can advise how I can troubleshoot further and find a solution to
>>>> the problem?
>>>>
>>>> Looking forward to your reply, thank you.
>>>>
>>>> Cheers.
>>>>
>>>>
>>>
>>>
>>

Re: High CPU utilization on KVM hosts while doing RBD snapshot - was Re: snapshot caused host disconnected

Posted by Indra Pramana <in...@sg.or.id>.
Hi Wido and all,

Good day to you, and thank you for your e-mail reply.

Yes, from the agent logs I can see that the RBD snapshot was created.
However, it seems that CPU utilisation goes up drastically during the
copying over of the snapshot from primary storage to the secondary storage.

===
2013-10-08 00:01:58,765 DEBUG [cloud.agent.Agent]
(agentRequest-Handler-5:null) Request:Seq 34-898172006:  { Cmd , MgmtId:
161342671900, via: 34, Ver: v1, Flags: 100011,
[{"org.apache.cloudstack.storage.command.CreateObjectCommand":{"data":{"org.apache.cloudstack.storage.to.SnapshotObjectTO":{"volume":{"uuid":"0c4f8e41-dfd8-4fc2-a22e-1a79738560a1","volumeType":"DATADISK","dataStore":{"org.apache.cloudstack.storage.to.PrimaryDataStoreTO":{"uuid":"d433809b-01ea-3947-ba0f-48077244e4d6","id":214,"poolType":"RBD","host":"
ceph-mon.simplercloud.com
","path":"simplercloud-sg-01","port":6789}},"name":"DATA-2051","size":64424509440,"path":"fc5dfa05-2431-4b42-804b-b2fb72e219d0","volumeId":2289,"vmName":"i-195-2051-VM","accountId":195,"format":"RAW","id":2289,"hypervisorType":"KVM"},"parentSnapshotPath":"simplercloud-sg-01/fc5dfa05-2431-4b42-804b-b2fb72e219d0/61042668-23ab-4f63-8a21-ce5a24f9c883","dataStore":{"org.apache.cloudstack.storage.to.PrimaryDataStoreTO":{"uuid":"d433809b-01ea-3947-ba0f-48077244e4d6","id":214,"poolType":"RBD","host":"
ceph-mon.simplercloud.com","path":"simplercloud-sg-01","port":6789}},"vmName":"i-195-2051-VM","name":"test-snapshot-and-ip-1_DATA-2051_20131007160158","hypervisorType":"KVM","id":22}},"wait":0}}]
}
2013-10-08 00:01:58,765 DEBUG [cloud.agent.Agent]
(agentRequest-Handler-5:null) Processing command:
org.apache.cloudstack.storage.command.CreateObjectCommand
2013-10-08 00:02:08,071 DEBUG [cloud.agent.Agent]
(agentRequest-Handler-1:null) Request:Seq 34-898172007:  { Cmd , MgmtId:
161342671900, via: 34, Ver: v1, Flags: 100011,
[{"org.apache.cloudstack.storage.command.CreateObjectCommand":{"data":{"org.apache.cloudstack.storage.to.SnapshotObjectTO":{"volume":{"uuid":"35d9bae0-1683-4a3d-9a69-ccefa18bf899","volumeType":"DATADISK","dataStore":{"org.apache.cloudstack.storage.to.PrimaryDataStoreTO":{"uuid":"d433809b-01ea-3947-ba0f-48077244e4d6","id":214,"poolType":"RBD","host":"
ceph-mon.simplercloud.com
","path":"simplercloud-sg-01","port":6789}},"name":"DATA-2046","size":42949672960,"path":"59825284-6b60-4a37-b728-755b3752a755","volumeId":2278,"vmName":"i-190-2046-VM","accountId":190,"format":"RAW","id":2278,"hypervisorType":"KVM"},"dataStore":{"org.apache.cloudstack.storage.to.PrimaryDataStoreTO":{"uuid":"d433809b-01ea-3947-ba0f-48077244e4d6","id":214,"poolType":"RBD","host":"
ceph-mon.simplercloud.com","path":"simplercloud-sg-01","port":6789}},"vmName":"i-190-2046-VM","name":"test-aft-upgrade-12-win_DATA-2046_20131007160207","hypervisorType":"KVM","id":23}},"wait":0}}]
}
...
2013-10-08 00:02:08,191 DEBUG [kvm.storage.KVMStorageProcessor]
(agentRequest-Handler-5:null) Succesfully connected to Ceph cluster at
ceph-mon.simplercloud.com:6789
2013-10-08 00:02:08,214 DEBUG [kvm.storage.KVMStorageProcessor]
(agentRequest-Handler-5:null) Attempting to create RBD snapshot
fc5dfa05-2431-4b42-804b-b2fb72e219d0@d6df6e15-d2ec-46a6-8b75-1f94c127dbb8
...
2013-10-08 00:02:20,821 DEBUG [kvm.storage.KVMStorageProcessor]
(agentRequest-Handler-1:null) Succesfully connected to Ceph cluster at
ceph-mon.simplercloud.com:6789
...
2013-10-08 00:05:19,580 DEBUG [kvm.storage.KVMStorageProcessor]
(agentRequest-Handler-3:null) Succesfully connected to Ceph cluster at
ceph-mon.simplercloud.com:6789
2013-10-08 00:05:19,610 DEBUG [kvm.storage.KVMStorageProcessor]
(agentRequest-Handler-3:null) Attempting to create
/mnt/c10404d3-070e-3579-980e-cb0d40effb7b/snapshots/190/2277 recursively
2013-10-08 00:05:20,645 DEBUG [kvm.storage.KVMStorageProcessor]
(agentRequest-Handler-3:null) Backing up RBD snapshot to
/mnt/c10404d3-070e-3579-980e-cb0d40effb7b/snapshots/190/2277/5c9ea7c4-03e1-455e-a785-3c96df68cf69
===

After which, depends on the size of the snapshot to be backed up, the KVM
host will suffer from high CPU utilisation and will sometimes time-out and
disconnected from the management server. Then HA will kick in and
everything goes haywire.

This is what I have done:

(1) Enabling jumbo frames (MTU 9000) on the NIC cards and the switch ports
where the KVM hosts and the secondary storage server are connected to. We
are using 10 GBps NIC cards and switches.

(2) Performed some fine-tuning on the KVM hosts' /etc/sysctl.conf to
improve the network performance:

net.ipv4.tcp_wmem = 4096 65536 16777216
net.ipv4.tcp_rmem = 4096 87380 16777216
net.core.wmem_max = 16777216
net.core.rmem_max = 16777216
net.core.wmem_default = 65536
net.core.rmem_default = 87380
net.core.netdev_max_backlog = 30000

(3) Try to disable RBD snapshot copying from primary to secondary storage
by setting this global Cloudstack configuration:

snapshot.backup.rightafter = false

However, when I tested, the snapshotting process will still try to copy the
snapshot backup over to the secondary storage. I checked on another thread
and it seems to be a bug on 4.2.0.

Is it actually a necessity for the snapshots to be backed up to secondary
storage, or is it OK for the snapshots to stay in primary storage? It seems
that the process of copying over the snapshot from primary to secondary
storage is the root cause of the high CPU utilisation on the KVM hosts.

Due to the above bug on point (3), is there a workaround on how we can
prevent the snapshots to be copied over to the secondary storage after it's
being created on the primary storage?

Looking forward to your reply, thank you.

Cheers.



On Tue, Oct 8, 2013 at 2:41 PM, Wido den Hollander <wi...@widodh.nl> wrote:

>
>
> On 10/08/2013 04:59 AM, Indra Pramana wrote:
>
>> Dear Wido and all,
>>
>> I performed some further tests last night:
>>
>> (1) CPU utilization of the KVM host while RBD snapshot running is still
>> shooting up high even after I set global setting:
>> concurrent.snapshots.**threshold.perhost to 2.
>>
>> (2) Most of the concurrent snapshot processes will fail with either stuck
>> in "Creating" state, or "CreatedOnPrimary" error message.
>>
>>
> Hmm, that is odd. It uses rados-java to call the RBD library to create the
> snapshot and afterwards it copies it to Secondary Storage.
>
> I'm leaving for the Ceph Days and the Build a Cloud Day afterwards in
> London now, so I won't be able to look at this the coming 2 days.
>
>
>  (3) I also have adjusted some other related global settings such as
>> backup.snapshot.wait and job.expire.minutes, without any luck.
>>
>> Any advise on the reason what causes the high CPU utilization is greatly
>> appreciated.
>>
>>
> You might want to set the Agent log to debug and see if the RBD snapshot
> was created, it should log that: https://git-wip-us.apache.org/**
> repos/asf?p=cloudstack.git;a=**blob;f=plugins/hypervisors/**
> kvm/src/com/cloud/hypervisor/**kvm/storage/**KVMStorageProcessor.java;h=**
> 1b883519073acc7514b66857e080a4**64714c4324;hb=4.2#l1091<https://git-wip-us.apache.org/repos/asf?p=cloudstack.git;a=blob;f=plugins/hypervisors/kvm/src/com/cloud/hypervisor/kvm/storage/KVMStorageProcessor.java;h=1b883519073acc7514b66857e080a464714c4324;hb=4.2#l1091>
>
> "Attempting to create RBD snapshot"
>
> If that succeeds the problem lies with backing up the snapshot to
> Secondary Storage.
>
> Wido
>
>
>  Looking forward to your reply, thank you.
>>
>> Cheers.
>>
>>
>> On Mon, Oct 7, 2013 at 11:03 PM, Indra Pramana <in...@sg.or.id> wrote:
>>
>>  Dear all,
>>>
>>> I also found out that when the RBD snapshot is being run, the CPU
>>> utilisation on the KVM host will be shooting up very high, which might
>>> explain why the host becomes disconnected.
>>>
>>> top - 22:49:32 up 3 days, 19:31,  1 user,  load average: 7.85, 4.97, 3.47
>>> Tasks: 297 total,   3 running, 294 sleeping,   0 stopped,   0 zombie
>>> Cpu(s):  4.5%us,  1.2%sy,  0.0%ni, 94.1%id,  0.1%wa,  0.0%hi,  0.0%si,
>>> 0.0%st
>>> Mem:  264125244k total, 77203460k used, 186921784k free,   154888k
>>> buffers
>>> Swap:   545788k total,        0k used,   545788k free, 60677092k cached
>>>
>>>    PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
>>> 18161 root      20   0 3871m  31m 8444 S  101  0.0 301:58.09 kvm
>>>   2790 root      20   0 43.5g 1.6g  19m S   97  0.7  45:52.42 jsvc
>>> 24544 root      20   0 4583m  31m 8364 S   97  0.0 425:29.48 kvm
>>>   6537 root      20   0     0    0    0 R   71  0.0   0:17.49 kworker/3:2
>>> 22546 root      20   0 6143m 2.0g 8452 S   26  0.8  55:14.07 kvm
>>>   4219 root      20   0 7671m 4.0g 8524 S    6  1.6 106:12.26 kvm
>>>   5989 root      20   0 43.2g 1.6g  232 D    6  0.6   0:08.13 jsvc
>>>   5993 root      20   0 43.3g 1.6g  224 D    6  0.6   0:08.36 jsvc
>>>
>>> Is it normal when snapshot is being run on the VM running on that host,
>>> the host's CPU utilisation will be higher than usual? How can I limit the
>>> CPU resources used by the snapshot?
>>>
>>>
>>> Looking forward to your reply, thank you.
>>>
>>> Cheers.
>>>
>>>
>>>
>>> On Mon, Oct 7, 2013 at 7:18 PM, Indra Pramana <in...@sg.or.id> wrote:
>>>
>>>  Dear all,
>>>>
>>>> I did some tests on snapshots since it's now supported for my Ceph RBD
>>>> primary storage in CloudStack 4.2. When I ran the snapshot for a
>>>> particular
>>>> VM instance earlier, I noticed that this has caused the host (where the
>>>> VM
>>>> is on) becomes disconnected.
>>>>
>>>> Here's the excerpt from the agent.log:
>>>>
>>>> http://pastebin.com/dxVV7stu
>>>>
>>>> The management-server.log doesn't much showing anything other than
>>>> detecting that the host was down and HA is being activated:
>>>>
>>>> http://pastebin.com/UeLiSm9K
>>>>
>>>> Anyone can advise what is causing the problem? So far there is only one
>>>> user doing the snapshotting and it has caused issues to the host, I
>>>> can't
>>>> imagine what if multiple users try to do snapshotting at the same time?
>>>>
>>>> I read about snapshot job throttling which is described on the manual:
>>>>
>>>>
>>>> http://cloudstack.apache.org/**docs/en-US/Apache_CloudStack/**
>>>> 4.2.0/html/Admin_Guide/**working-with-snapshots.html<http://cloudstack.apache.org/docs/en-US/Apache_CloudStack/4.2.0/html/Admin_Guide/working-with-snapshots.html>
>>>>
>>>> But I am not too sure whether this will help to resolve the problem
>>>> since
>>>> there is only one user trying to perform snapshot and we already
>>>> encounter
>>>> the problem already.
>>>>
>>>> Anyone can advise how I can troubleshoot further and find a solution to
>>>> the problem?
>>>>
>>>> Looking forward to your reply, thank you.
>>>>
>>>> Cheers.
>>>>
>>>>
>>>
>>>
>>

Re: High CPU utilization on KVM hosts while doing RBD snapshot - was Re: snapshot caused host disconnected

Posted by Wido den Hollander <wi...@widodh.nl>.

On 10/08/2013 04:59 AM, Indra Pramana wrote:
> Dear Wido and all,
>
> I performed some further tests last night:
>
> (1) CPU utilization of the KVM host while RBD snapshot running is still
> shooting up high even after I set global setting:
> concurrent.snapshots.threshold.perhost to 2.
>
> (2) Most of the concurrent snapshot processes will fail with either stuck
> in "Creating" state, or "CreatedOnPrimary" error message.
>

Hmm, that is odd. It uses rados-java to call the RBD library to create 
the snapshot and afterwards it copies it to Secondary Storage.

I'm leaving for the Ceph Days and the Build a Cloud Day afterwards in 
London now, so I won't be able to look at this the coming 2 days.

> (3) I also have adjusted some other related global settings such as
> backup.snapshot.wait and job.expire.minutes, without any luck.
>
> Any advise on the reason what causes the high CPU utilization is greatly
> appreciated.
>

You might want to set the Agent log to debug and see if the RBD snapshot 
was created, it should log that: 
https://git-wip-us.apache.org/repos/asf?p=cloudstack.git;a=blob;f=plugins/hypervisors/kvm/src/com/cloud/hypervisor/kvm/storage/KVMStorageProcessor.java;h=1b883519073acc7514b66857e080a464714c4324;hb=4.2#l1091

"Attempting to create RBD snapshot"

If that succeeds the problem lies with backing up the snapshot to 
Secondary Storage.

Wido

> Looking forward to your reply, thank you.
>
> Cheers.
>
>
> On Mon, Oct 7, 2013 at 11:03 PM, Indra Pramana <in...@sg.or.id> wrote:
>
>> Dear all,
>>
>> I also found out that when the RBD snapshot is being run, the CPU
>> utilisation on the KVM host will be shooting up very high, which might
>> explain why the host becomes disconnected.
>>
>> top - 22:49:32 up 3 days, 19:31,  1 user,  load average: 7.85, 4.97, 3.47
>> Tasks: 297 total,   3 running, 294 sleeping,   0 stopped,   0 zombie
>> Cpu(s):  4.5%us,  1.2%sy,  0.0%ni, 94.1%id,  0.1%wa,  0.0%hi,  0.0%si,
>> 0.0%st
>> Mem:  264125244k total, 77203460k used, 186921784k free,   154888k buffers
>> Swap:   545788k total,        0k used,   545788k free, 60677092k cached
>>
>>    PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
>> 18161 root      20   0 3871m  31m 8444 S  101  0.0 301:58.09 kvm
>>   2790 root      20   0 43.5g 1.6g  19m S   97  0.7  45:52.42 jsvc
>> 24544 root      20   0 4583m  31m 8364 S   97  0.0 425:29.48 kvm
>>   6537 root      20   0     0    0    0 R   71  0.0   0:17.49 kworker/3:2
>> 22546 root      20   0 6143m 2.0g 8452 S   26  0.8  55:14.07 kvm
>>   4219 root      20   0 7671m 4.0g 8524 S    6  1.6 106:12.26 kvm
>>   5989 root      20   0 43.2g 1.6g  232 D    6  0.6   0:08.13 jsvc
>>   5993 root      20   0 43.3g 1.6g  224 D    6  0.6   0:08.36 jsvc
>>
>> Is it normal when snapshot is being run on the VM running on that host,
>> the host's CPU utilisation will be higher than usual? How can I limit the
>> CPU resources used by the snapshot?
>>
>>
>> Looking forward to your reply, thank you.
>>
>> Cheers.
>>
>>
>>
>> On Mon, Oct 7, 2013 at 7:18 PM, Indra Pramana <in...@sg.or.id> wrote:
>>
>>> Dear all,
>>>
>>> I did some tests on snapshots since it's now supported for my Ceph RBD
>>> primary storage in CloudStack 4.2. When I ran the snapshot for a particular
>>> VM instance earlier, I noticed that this has caused the host (where the VM
>>> is on) becomes disconnected.
>>>
>>> Here's the excerpt from the agent.log:
>>>
>>> http://pastebin.com/dxVV7stu
>>>
>>> The management-server.log doesn't much showing anything other than
>>> detecting that the host was down and HA is being activated:
>>>
>>> http://pastebin.com/UeLiSm9K
>>>
>>> Anyone can advise what is causing the problem? So far there is only one
>>> user doing the snapshotting and it has caused issues to the host, I can't
>>> imagine what if multiple users try to do snapshotting at the same time?
>>>
>>> I read about snapshot job throttling which is described on the manual:
>>>
>>>
>>> http://cloudstack.apache.org/docs/en-US/Apache_CloudStack/4.2.0/html/Admin_Guide/working-with-snapshots.html
>>>
>>> But I am not too sure whether this will help to resolve the problem since
>>> there is only one user trying to perform snapshot and we already encounter
>>> the problem already.
>>>
>>> Anyone can advise how I can troubleshoot further and find a solution to
>>> the problem?
>>>
>>> Looking forward to your reply, thank you.
>>>
>>> Cheers.
>>>
>>
>>
>

RE: High CPU utilization on KVM hosts while doing RBD snapshot - was Re: snapshot caused host disconnected

Posted by Suresh Sadhu <Su...@citrix.com>.
Indra,

Are you seeing high cpu utilization only thorough cloud stack or  its same even with  ceph also. please try with below commands  and share your results after performing snapshot from ceph .this will help us to isolate the problem  is it really  happening at the time of snapshot or  at the time of file conversion happening at background.

--> Create a volume ,write some data  and perform snapshot

##
rbd --pool rbd snap create --snap snapname foo
rbd snap create rbd/foo@snapname

regards
sadhu



-----Original Message-----
From: Indra Pramana [mailto:indra@sg.or.id] 
Sent: 08 October 2013 08:29
To: dev@cloudstack.apache.org; users@cloudstack.apache.org
Cc: Wido den Hollander
Subject: High CPU utilization on KVM hosts while doing RBD snapshot - was Re: snapshot caused host disconnected

Dear Wido and all,

I performed some further tests last night:

(1) CPU utilization of the KVM host while RBD snapshot running is still shooting up high even after I set global setting:
concurrent.snapshots.threshold.perhost to 2.

(2) Most of the concurrent snapshot processes will fail with either stuck in "Creating" state, or "CreatedOnPrimary" error message.

(3) I also have adjusted some other related global settings such as backup.snapshot.wait and job.expire.minutes, without any luck.

Any advise on the reason what causes the high CPU utilization is greatly appreciated.

Looking forward to your reply, thank you.

Cheers.


On Mon, Oct 7, 2013 at 11:03 PM, Indra Pramana <in...@sg.or.id> wrote:

> Dear all,
>
> I also found out that when the RBD snapshot is being run, the CPU 
> utilisation on the KVM host will be shooting up very high, which might 
> explain why the host becomes disconnected.
>
> top - 22:49:32 up 3 days, 19:31,  1 user,  load average: 7.85, 4.97, 3.47
> Tasks: 297 total,   3 running, 294 sleeping,   0 stopped,   0 zombie
> Cpu(s):  4.5%us,  1.2%sy,  0.0%ni, 94.1%id,  0.1%wa,  0.0%hi,  0.0%si, 
> 0.0%st
> Mem:  264125244k total, 77203460k used, 186921784k free,   154888k buffers
> Swap:   545788k total,        0k used,   545788k free, 60677092k cached
>
>   PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
> 18161 root      20   0 3871m  31m 8444 S  101  0.0 301:58.09 kvm
>  2790 root      20   0 43.5g 1.6g  19m S   97  0.7  45:52.42 jsvc
> 24544 root      20   0 4583m  31m 8364 S   97  0.0 425:29.48 kvm
>  6537 root      20   0     0    0    0 R   71  0.0   0:17.49 kworker/3:2
> 22546 root      20   0 6143m 2.0g 8452 S   26  0.8  55:14.07 kvm
>  4219 root      20   0 7671m 4.0g 8524 S    6  1.6 106:12.26 kvm
>  5989 root      20   0 43.2g 1.6g  232 D    6  0.6   0:08.13 jsvc
>  5993 root      20   0 43.3g 1.6g  224 D    6  0.6   0:08.36 jsvc
>
> Is it normal when snapshot is being run on the VM running on that 
> host, the host's CPU utilisation will be higher than usual? How can I 
> limit the CPU resources used by the snapshot?
>
>
> Looking forward to your reply, thank you.
>
> Cheers.
>
>
>
> On Mon, Oct 7, 2013 at 7:18 PM, Indra Pramana <in...@sg.or.id> wrote:
>
>> Dear all,
>>
>> I did some tests on snapshots since it's now supported for my Ceph 
>> RBD primary storage in CloudStack 4.2. When I ran the snapshot for a 
>> particular VM instance earlier, I noticed that this has caused the 
>> host (where the VM is on) becomes disconnected.
>>
>> Here's the excerpt from the agent.log:
>>
>> http://pastebin.com/dxVV7stu
>>
>> The management-server.log doesn't much showing anything other than 
>> detecting that the host was down and HA is being activated:
>>
>> http://pastebin.com/UeLiSm9K
>>
>> Anyone can advise what is causing the problem? So far there is only 
>> one user doing the snapshotting and it has caused issues to the host, 
>> I can't imagine what if multiple users try to do snapshotting at the same time?
>>
>> I read about snapshot job throttling which is described on the manual:
>>
>>
>> http://cloudstack.apache.org/docs/en-US/Apache_CloudStack/4.2.0/html/
>> Admin_Guide/working-with-snapshots.html
>>
>> But I am not too sure whether this will help to resolve the problem 
>> since there is only one user trying to perform snapshot and we 
>> already encounter the problem already.
>>
>> Anyone can advise how I can troubleshoot further and find a solution 
>> to the problem?
>>
>> Looking forward to your reply, thank you.
>>
>> Cheers.
>>
>
>