You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@cloudstack.apache.org by Edison Su <Ed...@citrix.com> on 2013/06/04 01:17:18 UTC

[DISCUSS] NFS cache storage issue on object_store

Let's start a new thread about NFS cache storage issues on object_store.
First, I'll go through how NFS storage works on master branch, then how it works on object_store branch, then let's talk about the "issues".

0.       Why we need NFS secondary storage? Nfs secondary storage is used as a place to store templates/snapshots etc, it's zone wide, and it's widely supported by most of hypervisors(except HyperV). NFS storage exists in CloudStack since 1.x. With the rising of object storage, like S3/Swift, CloudStack adds the support of Swift in 3.x, and S3 in 4.0. You may wonder, if S3/Swift is used as the place to store templates/snapshots, then why we still need NFS secondary storage?

There are two reasons for that:

a.       CloudStack storage code is tightly coupled with NFS secondary storage, so when adding Swift/S3 support, it's likely to take shortcut, leave NFS secondary storage as it is.

b.      Certain hypervisors, and certain storage related operations, can not directly operate on object storage.
Examples:

b.1 When backing up snapshot(the snapshot taken from xenserver hypervisor) from primary storage to S3 in xenserver

If there are snapshot chains on the volume, and if we want to coalesce the snapshot chains into a new disk, then copy it to S3, we either, coalesce the snapshot chains on primary storage, or on an extra storage repository (SR) that supported by Xenserver.

If we coalesce it on primary storage, then may blow up the primary storage, as the coalesced new disk may need a lot of space(thinking about, the new disk will contain all the content in from leaf snapshot, all the way up to base template), but the primary storage is not planned to this operation(cloudstack mgt server is unaware of this operation, the mgt server may think the primary storage still has enough space to create volumes).

While xenserver doesn't have API to coalesce snapshots directly to S3, so we have to use other storages that supported by Xenserver, that's why the NFS storage is used during snapshot backup. So what we did is that first call xenserver api to coalesce the snapshot to NFS storage, then copy the newly created file into S3. This is what we did on both master branch and object_store branch.
                               b.2 When create volume from snapshot if the snapshot is stored on S3.
                                                 If the snapshot is a delta snapshot, we need to coalesce them into a new volume. We can't coalesce snapshots directly on S3, AFAIK, so we have to download the snapshot and its parents into somewhere, then coalesce them with xenserver's tools. Again, there are two options, one is to download all the snapshots into primary storage, or download them into NFS storage:
                                                If we download all the snapshots into primary storage directly from S3, then first we need find a way import snapshot from S3 into Primary storage(if primary storage is a block device, then need extra care) and then coalesce them. If we go this way, need to find a primary storage with enough space, and even worse, if the primary storage is not zone-wide, then later on, we may need to copy the volume from one primary storage to another, which is time consuming.
                                                If we download all the snapshots into NFS storage from S3, then coalesce them, and then copy the volume to primary storage. As the NFS storage is zone wide, so, you can copy the volume into whatever primary storage, without extra copy. This is what we did in master branch and object_store branch.
                              b.3, some hypervisors, or some storages do not support directly import template into primary storage from a URL. For example, if Ceph is used as primary storage, when import a template into RBD, need transform a Qcow2 image into RAW disk, then into RBD format 2. In order to transform an image from Qcow2 image into RAW disk, you need extra file system, either a local file system(this is what other stack does, which is not scalable to me), or a NFS storage(this is what can be done on both master and object_store). Or one can modify hypervisor or storage to support directly import template from S3 into RBD. Here is the link(http://www.mail-archive.com/ceph-devel@vger.kernel.org/msg14411.html), that Wido posted.
                 Anyway, there are so many combination of hypervisors and storages: for some hypervisors with zone wide file system based storage(e.g. KVM + gluster/NFS as primary storage), you don't need extra nfs storage. Also if you are using VMware or HyperV, which can import template from a URL, regardless which storage your are using, then you don't need extra NFS storage. While if you are using xenserver, in order to create volume from delta snapshot, you will need a NFS storage, or if you are using KVM + Ceph, you also may need a NFS storage.
                Due to above reasons, NFS cache storage is need in certain cases if S3 is used as secondary storage. The combination of hypervisors and storages are quite complicated, to use cache storage or not, should be case by case. But as long as cloudstack provides a framework, gives people the choice to enable/disable cache storage on their own, then I think the framework is  good enough.


1.       Then let's talk about how NFS storage works on master branch, with or without S3.
If S3 is not used, here is the how NFS storage is used:

1.1   Register a template/ISO: cloudstack downloads the template/ISO into NFS storage.

1.2   Backup snapshot: cloudstack sends a command to xenserver hypervisor, issue vdi.copy command copy the snapshot to NFS, for kvm, directly use "cp" or "qemu-img convert" to copy the snapshot into NFS storage.

1.3   Create volume from snapshot: If the snapshot is a delta snapshot, coalesce them on NFS storage, then vdi.copy it from NFS to primary storage. If it's KVM, use "cp" or "qemu-img convert" to copy the snapshot from NFS storage to primary storage.


               If S3 is used:

1.4   Register a template/ISO: download the template/ISO into NFS storage first, then there is background thread, which can upload the template/ISO from NFS storage into S3 regularly. The template is in Ready state, only means the template is stored on NFS storage, but admin doesn't know the template is stored on the S3 or not. Even worse, if there are multiple zones, cloudstack will copy the template from one zone wide NFS storage into another NFS storage in another zone, while there is already has a region wide S3 available. As the template is not directly uploaded to S3 when registering a template, it will take several copy in order to spread the template into a region wide.

1.5   Backup snapshot: cloudstack sends a command to xenserver hypervisor, copy the snapshot to NFS storage, then immediately, upload the snapshot from NFS storage into S3. The snapshot is in Backedup state, not only means the snapshot is in  NFS storage, but also means it's stored on S3.

1.6   Create volume from snapshot: download the snapshot  and it's parent snapshots from S3 into NFS storage, then coalesce and vdi.copy the volume from NFS to primary storage.



2.       Then let's talk about how it works on object_store:
If S3 is not used, there is ZERO change from master branch. How the NFS secondary storage works before, is the same on object_store.
If S3 is used, and NFS cache storage used also(which is by default):
   2.1 Register a template/ISO: the template/ISO are directly uploaded to S3, there is no extra copy to NFS storage. When the template is in "Ready" state, means the template is stored on S3.                  It implies that: the template is immediately available in the region as soon as it's in Ready State. And admin can clearly knows the status of template on S3, what's percentage of the uploading, is it failed or succeed? Also if register template failed for some reason, admin can issue the register template command again. I would say the change of how to register template into S3 is far better than what we did on master branch.
   2.2 Backup snapshot: it's same as master branch, sends a command to xenserver host, copy the snapshot into NFS, then upload to S3.
   2.3 Create volume from snapshot: it's the same as master branch, download snapshot and it's parent snaphots from S3 into NFS, then copy it from NFS to primary storage.
>From above few typical usage cases, you may understand how S3 and NFS cache storage is used, and what's difference between object_store branch and master branch: basically, we only change the way how to register a template, nothing else.
If S3 is used, and no NFS cache storage is used(it's possible, depends on which datamotion strategy is used):
    2.4 Register a template/ISO: it's the same as 2.1
    2.5 Backup snapshot: export the snapshot from primary storage into S3 directly
    2.6 Create volume from snapshot: download snapshots from S3 into primary storage directly, then coalesce and create volume from it.

          Hope above explanation will tell the truth how the system works on object_store, and clarify the misconception/misunderstanding  about object_store branch. Even the change is huge, we still maintain the back compatibility. If you don't want to use S3, only want to existing NFS storage, it's definitely OK, it works the same as before. If you want to use S3, we provide a better S3 implementation when registering template/ISO. If you want to use S3 without NFS storage, that's also definitely OK,  the framework is quite flexible to accommodate different solutions.

Ok, let's talk  about the NFS storage cache issues.
The issue about NFS cache storage is discussed in several threads, back and forth. All in all, the NFs cache storage is only one usage case out of three usage cases supported by object_store branch. It's not something that if it has issue, then everything doesn't work.
In above 2.2 and 2.3, it shows how the NFS cache storage is involved during snapshot related operations. The complains about there is no aging policy, no capacity planner for NFS cache storage, is happened when download a snapshot from S3 into NFS, or copy a snapshot from primary storage into NFS, or download template from S3 into NFS. Yes, it's an issue, the NFS cache storage can be used out, if there is no capacity planner, and no aging out policy. But can it be fixed? Is it a design issue?
Let's talk the code: Here is the code related to NFS cache storage, not much, only one class depends on NFS cache storage: https://git-wip-us.apache.org/repos/asf?p=cloudstack.git;a=blob;f=engine/storage/datamotion/src/org/apache/cloudstack/storage/motion/AncientDataMotionStrategy.java;h=a01d2d30139f70ad8c907b6d6bc9759d47dcc2d6;hb=refs/heads/object_store
Take copyVolumeFromSnapshot as example, which will be called when create Volume from snapshot, if first calls cacheSnapshotChain, which will call cacheMgr.createCacheObject to download the snapshot into NFs cache storage. StorageCacheManagerImpl-> createCacheObject is the only place to create objects on NFs cache storage, the code is at https://git-wip-us.apache.org/repos/asf?p=cloudstack.git;a=blob;f=engine/storage/cache/src/org/apache/cloudstack/storage/cache/manager/StorageCacheManagerImpl.java;h=cb5ea106fed3e5d2135dca7d98aede13effcf7d9;hb=refs/heads/object_store
In createCacheObject, it will first find out a cache storage, in case there are multiple cache storages available in a scope:
DataStore cacheStore = this.getCacheStorage(scope);
getCacheStorage will call StorageCacheAllocator to find out a proper NFS cache storage. So StorageCacheAllocator is the place to choose NFS cache storage based on certain criteria, the current implementation only randomly choose one of them, we can add a new allocator algorithm, based on capacity etc, etc.
Regarding capacity reservation, there is already a table, called op_host_capacity which has entry for NFS secondary storage, we can reuse this entry to store capacity information about NFS cache storages(such as, total size, available/used capacity etc). So when every call createCacheObject, we can call StorageCacheAllocator to find out a proper NFS storage based on first fit criteria, then increase used capacity in op_host_capacity table. If the create cache object failed, return the capacity to op_host_capacity.

Regarding the aging out policy, we can start a background thread on mgt server, which will scan all the objects created on NFS cache storage(the tables called: snapshot_store_ref, template_store_ref, volume_store_ref), each entry of these tables has a column called: updated, every time, when the object's state is changed, the "updated" column will be got updated also. When the object's state is changed? Every time, when the object is used in some contexts(such as copy the snapshot on NFS cache storage into somewhere), the object's state will be changed  accordingly, such as "Copying", means the object is being copied to some place, which is exactly the information we need to implement LRU algorithm.

How do you guys think about the fix? If you have better solution, please let me know.



Re: [DISCUSS] NFS cache storage issue on object_store

Posted by John Burwell <jb...@basho.com>.
Edison,

Please see my commons in-line below.

Thanks,
-John

On Jun 6, 2013, at 6:43 PM, Edison Su <Ed...@citrix.com> wrote:

> 
> 
>> -----Original Message-----
>> From: John Burwell [mailto:jburwell@basho.com]
>> Sent: Thursday, June 06, 2013 7:47 AM
>> To: dev@cloudstack.apache.org
>> Subject: Re: [DISCUSS] NFS cache storage issue on object_store
>> 
>> Edison,
>> 
>> Please my comments in-line below.
>> 
>> Thanks,
>> -John
>> 
>> On Jun 5, 2013, at 6:55 PM, Edison Su <Ed...@citrix.com> wrote:
>> 
>>> 
>>> 
>>>> -----Original Message-----
>>>> From: John Burwell [mailto:jburwell@basho.com]
>>>> Sent: Wednesday, June 05, 2013 1:04 PM
>>>> To: dev@cloudstack.apache.org
>>>> Subject: Re: [DISCUSS] NFS cache storage issue on object_store
>>>> 
>>>> Edison,
>>>> 
>>>> You have provided some great information below which helps greatly to
>>>> understand the role of the "NFS cache" mechanism.  To summarize, this
>>>> mechanism is only currently required for Xen snapshot operations
>>>> driven by Xen's coalescing operations.  Is my understanding correct?
>>>> Just out of
>>> 
>>> I think Ceph may still need "NFS cache", for example, during delta snapshot
>> backup:
>>> http://ceph.com/dev-notes/incremental-snapshots-with-rbd/
>>> You need to create a delta snapshot into a file, then upload the file into S3
>>> 
>>> For KVM, if the snapshot is taken on qcow2, then need to copy the
>> snapshot into a file system, then backup it to S3.
>>> 
>>> Another usage case for "NFS cache " is to cache template stored on S3, if
>> there is no zone-wide primary storage. We need to download template from
>> S3 into every primary storage, if there is no cache, each download will take a
>> while: comparing download template directly from S3(if the S3 is region wide)
>> with download from a zone wide "cache" storage, I would say, the download
>> from zone wide cache storage should be faster than from region wide S3. If
>> there is no zone wide primary storage, then we will download the template
>> from S3 several times, which is quite time consuming.
>>> 
>>> 
>>> There may have other places to use "NFS cache", but the point is as
>>> long as mgt server can be decoupled from this "cache" storage, then we
>> can decide when/how to use cache storage based on different kind of
>> hypervisor/storage combinations in the future.
>> 
>> I think we would do well to re-orient the way we think about roles and
>> requirements.  Ceph doesn't need a file system to perform a delta snapshot
>> operation.  Xen, KVM, and/or VMWare need access to a file system to
> 
> For Ceph delta snapshot case, it's Ceph has the requirement that needs a file system to perform delta snapshot(http://ceph.com/docs/next/man/8/rbd/):
> 
> export-diff [image-name] [dest-path] [-from-snap snapname]
> Exports an incremental diff for an image to dest path (use - for stdout). If an initial snapshot is specified, only changes since that snapshot are included; otherwise, any regions of the image that contain data are included. The end snapshot is specified using the standard -snap option or @snap syntax (see below). The image diff format includes metadata about image size changes, and the start and end snapshots. It efficiently represents discarded or 'zero' regions of the image.
> 
> The dest-path is either a file, or stdout, if using stdout, then need a lot of memory. If using hypervisor's local file system, then the local file system may don't have enough space to store the delta diff.

I apologize for failing to read more closely -- I mistakenly assumed you were referring to hypervisor snapshots.  To my mind, if a local file system is needed by a storage driver to perform an operation then it should be a encapsulated within the driver's scope.  The storage layer should provide a suitable interface for the driver to acquire/release a reservation to the staging/temporary area if it needs it.

For Ceph specifically, stdout can be pushed through a BufferedOutputStream and written straight to the object store -- skipping the file system.  With this approach, we should be able to keep the memory required a fixed size and "pump" it out to the object store.  Ideally, we would define the interfaces to provide InputStreams and OutputStreams -- creating the potential for the copy operation to be implemented in the orchestration code.

> 
>> perform these operations.  The hypervisor plugin should request a
>> reservation of x size as a file handle from the Storage subsystem.  The Ceph
>> driver implements this request by using a staging area + transfer operation.
>> This approach encapsulates the operation/rules around the staging area from
>> clients, protects against concurrent requests flooding a resource, and allows
>> hypervisor-specific behavior/rules to encapsulated in the appropriate plugin.
>> 
>>> 
>>>> curiosity, is their a Xen expert on the list who can provide a
>>>> high-level description of the coalescing operation -- in particular,
>>>> the way it interacts with storage?  I have Googled a bit, and found very
>> little information about it.
>>>> Has the object_store branch been tested with VMWare and KVM?  If so,
>>>> what operations on these hypervisors have been tested?
>>> 
>>> Both vmware and KVM is tested, but without S3 support. Haven't have
>> time to take a look at how to use S3 in both hypervisors yet.
>>> For example, we should take a look at how to import a template from url
>> into vmware data store, thus, we can eliminate "NFS cache" during template
>> import.
>> 
>> Given the release extension and the impact of these tests on the
>> implementation, we need to test S3 with VMWare and KVM pre-merge.
> 
> I would like to handle over the implementation of S3(directly use S3 without nfs staging area) on both Vmware and KVM to the community, or in the next release, or after the merge.
> The reason, is simple, we need to get mgt server part refactor done at first, the hypervisor side implementation or optimization can be done after the mgt server side refactor. I think what we are doing at the mgt server side refactor paves the way for this kind of optimization at the hypervisor side.

This begs a larger question for me -- why is the implementation hypervisor specific?  Naively, it seems that fitting the current hypervisors for the new storage architecture would bring along this feature for nearly free.  I remain concerned that we have not adequately decoupled the Hypervisor and Storage layers.  As I have stated a few (thousand) times now, I am focused on breaking the circular dependency between the Hypervisor and Storage layers to avoid this type of feature stratification.

> 
>> 
>>> 
>>>> 
>>>> In reading through the description below, my operation concerns
>>>> remain regarding potential race conditions and resource exhaustion.
>>>> Also, in reading through the description, I think we should find a
>>>> new name for this mechanism.  As Chip has previous mentioned, a cache
>>>> implies the following
>>>> characteristics:
>>>> 
>>>>   1. Optional: Systems can operate without caches just more slowly.
>>>> However, with this mechanism, snapshots on Xen will not function.
>>> 
>>> 
>>> I agree on this one.
>>> 
>>>>   2. Volatility: Caches are backed by durable, non-volitale storage.
>>>> Therefore, if the cache's data is lost, it can be rebuilt from the
>>>> backing store and no data will be permanently lost from the system.
>>>> However, this mechanism contains snapshots in-transit to an object
>>>> store.  If the data contained in this "cache" were lost before its
>>>> transfer to the object store completed, the snapshot data would be lost.
>>> 
>>> It's the same thing for file cache on Linux file system. If the file cache is not
>> flushed into disk, while the machine lost power, then the data on the file
>> cache is lost.
>>> When we backup the snapshot from primary storage to S3, the snapshot is
>> copied to "Nfs cache", then immediately, copied from "Nfs cache" into S3. If
>> the snapshot on "Nfs cache" is lost, then the snapshot backup is failed. User
>> can issue another backup snapshot command in this case.
>>> So I don't think it's an issue.
>> 
>> The window of opportunity for data loss from a file system sync is much
>> narrower for the Linux filesystem that for this staging area.  Furthermore,
>> that risk can be largely (if not completely) mitigated with battery-backup
>> hardware and/or conservative NFS settings.
>> 
>> For this staging area, the object store may be unreachable for an extended
>> period of time (minutes, hours).  There are no cache flush settings or
>> hardware solutions when it becomes unavailable.  If the data is lost from the
>> staging area, it will be gone.  I think it is one of the largest issues with this
>> approach, and we must be careful to ensure that data can not be lost before
>> it is transferred out.
> 
> I agree. It's not I want use staging area, it's the limitation of hypervisor or storage, which can't directly transfer data in/out from S3 for some operations.
> I think we agree on the limitation and issues with the staging area, but that's the current reality. 
> If we want to remove staging area totally, need more resources to take a look at what can we do for each hypervisor, for each storage. We can't finish all the things in just one month.
> If other people is willing to help us in this area, I'll appreciate

I apologize if I haven't clearly expressed my recognition that we currently can't avoid the staging area in some circumstances.  I want to ensure that we implement it in a robust manner that avoid intro ducting instability into implementations using object storage.

> . 
> 
>> 
>>> 
>>>> 
>>>> In order to set expectations with users and better frame our design
>>>> conversation, I think it would be appropriate this mechanism as a
>>>> staging,
>>> 
>>> Ok, seems cache is confusing people, we can use other term, or document
>> it clearly, what's the role of the storage.
>>> Yes, it's just a temporary file system, which can be used to store some
>> temporary files.
>>> 
>>>> scratch, or temporary area.  I also recommend removing the notion of
>>>> NFS its name as NFS is initial implementation of this mechanism.  In
>>>> the future, I can see a desire for local filesystem, RBD, and iSCSI
>> implementations of it.
>>> 
>>> Agree, any storage can be used as "Cache" storage. If you take a look at
>> storagemanagerImpl->createCacheStore, it's nothing related to NFS.
>>> 
>>>> 
>>>> In terms of solving the potential race conditions and resource
>>>> exhaustion issues, I don't think an LRU approach will be sufficient
>>>> because the least recently used resource may be still be in use by
>>>> the system.  I think we should look to a reservation model with
>>>> reference counting where files are deleted when once no processes are
>>>> accessing them.  The following is a
>>>> (handwave-handwave) overview of the process I think would meet these
>>>> requirements:
>>>> 
>>>> 	1. Request a reservation for the maximum size of the file(s) that
>>>> will be processed in the staging area.
>>>> 		- If the file is already in the staging area, increase its
>>>> reference count
>>>> 		- If the reservation can not be fulfilled, we can either drop
>> the
>>>> process in a retry queue or reject it.
>>>> 	2. Perform work and transfer file(s) to/from the object store
>>>> 	3. Release the file(s) -- decrementing the reference count.  When
>>>> the reference count is <= 0, delete the file(s) from the staging area
>>> 
>>> I assume the reference count is stored in memory and inside SSVM?
>>> The reference count may not work properly, in case of multiple secondary
>> storage VMs and multiple mgt servers. And there may have a lot of places
>> other than SSVM can directly use the cached object.
>>> If we store the reference count on file system, then need to take a
>> lock(such as nfs lock, or lock file)to update, while the lock can be failed to
>> release, due to all kind of reasons(such as network).
>> 
>> We could implement reference counting in a number of ways.  The first
>> would be increment a value in the database before command submission to
>> the SSVM, and decrement as part of answer processing.  We could evaluate
> 
> I agree, we can add a ref count column in template/volume/snapshot_store_ref, which can track how many read users of the cached object.
> 
>> using a distributed framework such as Hazelcast (http://www.hazelcast.com)
>> which provides a distributed countdown latch
>> (http://www.hazelcast.com/docs/1.9.4/javadoc/com/hazelcast/core/ICount
>> DownLatch.html) across the SSVMs.  We need to avoid POSIX-style file
> 
> 
> Good to know.
> 
>> system locks because they are not consistently implemented/available (e.g.
>> OCFS2).
>> 
>> My first brush thoughts on it would be to use a database table in 4.2, and
>> evaluate adopting a something like Hazelcast in 4.3.  Personally, I would like
>> to see us move away from relying on relational database semantic to
>> implement distributed data structures (counters, locks, etc).  However, given
>> the time pressures, I don't think we have the time properly evaluate the
>> impact of adopting a more general purpose distributed framework in 4.2.
> 
> I agree.
> 
>> 
>> From a code perspective, I think it would behove us to implement a more
>> functional approach to command execution in order to ensure reference
>> counting, error handling, resource management are handled in a consistent
>> manner.  I implemented such an approach in
>> com.cloud.utils.db.GlobalLock#executeWithLock where locking around a
>> particular operation is managed separately form the actual operation being
>> performed.
> 
> I'll take a look at your implementation.
> 
>> 
>>> 
>>> I thought about it yesterday, about how to implement LRU. Originally,
>>> I though, we could eliminate race condition and track who is using objects
>> stored on cache storage by using state machine For example, whenever mgt
>> server wants to use the cached object, mgt server can change the state for
>> the cached object to "Copying"(there is a DB entry for each cached object),
>> after the copy is finished, then change the state into "Ready", and also
>> update "updated" column. It will eliminate the race condition, as only one
>> thread can access the cached object, and change its state. But the problem of
>> this way, is that, there are cases that multiple reader threads may want to
>> read the cached object at the same time: e.g. copy the same cached
>> template to multiple primary storages at the same time.
>>> 
>>> In order to accommodate multiple readers, I am trying to add a new db
>> table to track the users of  the cached object.
>>> The follow will be like the following:
>>> 1. mgt server wants to use the cached object, first, need to check the state
>> of the cached object, the state must be in ready state.
>>> 2. mgt server writes a db entry into DB, the entry will contain, the id of
>> cached object, the id of cached storage, the issued time. The db entry will
>> also contain a state: the state can be initial/processing/finished/failed. Mgt
>> server needs to set the state as "processing".
>>> 3. mgt server finishes the operation related the cached object, then mark
>> state of above db entry as "finished",  also update the time column of above
>> entry.
>>> 4. the above db entries will be removed if the state is not in "processing"
>> for a while(let's say one week?), or if the entry is in the "processing" state for
>> a while(let's say one day). In this way, mgt server can easily know which
>> cached object is used or not used recently, by take a look this db table.
>>> 5. If mgt server find a cached object is not used(there is no db entry in the
>> above table) for a while(let's say one week), then change the state of the
>> cached object into "destroying", then send command to ssvm to destroy the
>> object.
>>> 6. There is small window, that mgt server is changing the state of cached
>> object into "destroying"(there is no db entry is in "processing" state in the
>> above table,), while another thread is trying to copying(as the cached object
>> state is still in ready state), both DB operations will success, we can hold a DB
>> lock on the cached object entry, before both DB opeations.
>>> 
>>> How do you think?
>> 
>> The issue remains that is the least recently used (really accessed) object can
>> still be in use by a running process.  One example that pops to mind is a
>> popular, large template that has a set of longish running processes creating
>> from it.  As I described above, I think you can change issued time to a
>> reference count, and add logic to step 3 to decrement/check the object
>> count.  With the proper transaction semantics, we provide sufficient
>> consistency guarantees around a reference count.
> 
> Agree. I only need to track how many readers which are currently using cached object. So a ref cnt is enough, I don't even need to create a new db table to track the ref cnt, add a new refcnt column on the template/snapshot/volume_store_ref is good enough. Every time the ref cnt is updated, the "updated" column is got updated also, so that based on the ref cnt column and updated column, mgt server will know is there any other users using the cached object, and what's the last time, the cached object got used, then implement a LRU reclaim algorithm.

I think the safest approach for now is to simply delete the file from the staging area when the reference is >= 0.  This approach will likely incur some additional transfer, but it is simplest path to ensure the least amount resource consumption.  If we see performance issues, we can evaluate adding an LRU algorithm to hold objects in the staging longer.

> 
>> 
>> The other part that we must accommodate is resource reservation.  Client
>> need to declare the anticipated size of their use before starting an operation.
>> The Storage needs to track the amount of space committed vs. used, and fail
>> fast when it is clear that the system will not have the resources available to
>> fulfill a request.  For 4.2, I think we don;t have the time implement a robust
>> queueing/best efforts facility.  For 4.2, I think a checked exception indicating
>> temporary resource unavailability will be sufficient for clients to determine
>> the best course of recovery action (i.e. error out or retry).
> 
> The  resource reservation is something that haven't done well in cloudstack for a long time. There is no proper resource reservation for all the storage related operations, it's likely, the storage will get used out, if there are concurrent volume creation operations, as there is no lock at the mgt server to check/update storage capacity. 
> What I am trying to implement resource reservation is that:
> 1. For each storage(primary/secondary, or staging area) has a db entry in op_host_capacity, which contains the used/allocated/total size of each storage.
> 2. For each allocation operation(there is a common entry: datastore->create/delete), need to update above db entry in an atomic way:
> Either hold a DB row lock, then update, or implement a CompareAndSet method, so that in case of concurrent storage create/delete operations, the capacity is been updated properly.
> 3. Before each capacity update, if the used/total is beyond a certain threshold, then failed.

Hopefully, this work will lead to a more generic resource reservation system within CloudStack.  I think a resource_reservation table with a foreign key to the storage entity, a size, creation timestamp, last accessed timestamp, and id (UUID) will suffice.  We will also need a reservation_reource_lock table with a row per DataStore.  The reservation process would perform the following steps:

	1. Acquire a row-level lock from reservation_resource_lock table for the DataStore
	2. Sum the reservations for the device and determine if enough space exists
	3. If enough space exists, insert a row in the resource_reservation with the size, resource id, and UUID of the reservation
	4. Release the row-level lock on reservation_resource_lock table for the DataStore

Reservation release would follow a similar approach without the summation -- just a delete of the reservation by UUID.  As a backstop, we also need a reaper thread to kill reservations based on a TTL from the last accessed timestamp.

> 
> There are some known issues with the resource reservation:
> 1. The size of certain objects are unknown during the resource reservation, such as the template size(we may need to call httpclient on the mgt server to get the size of template, as the template is not been downloaded into secondary storage, in case of register template), or the snapshot size(in case of copying snapshot from primary to nfs staging area, the mgt server doesn't know the size of snapshot before issuing the copy command, so don't know how to make the resource reservation)

For templates, we will need to know size to transfer to the object store.  For a snapshot, we can start with a reservation for the total size of the Volume being snapshotted.  The reservation does not need to be precise.  I must be large enough to fit the results of the operation.  Therefore, if a Volume is defined to be 10GB in size, but the snapshot only occupies a 500MB of space then we reserve 10GB.  We are assured that the snapshot operation will not fail due to a lack of disk space.  On the downside, we may crowd out other operations, but i would rather block other operations than have a race to fill the disk.

> 2. Due to above issue 1, capacity db table is out-of-sync with the actual storage usage. No matter how carefully coded at the mgt server, capacity info in DB can be out-of-sync with actual physical capacity. Need to sync with the info returned by GetStorageStatsCommand.
> 3. Storage over provisioning: current only NFS storage can do over provisioning, but I think it should be decided by each storage provider.

Agreed.  The DataStore should be queried for available free space which in turn should be implemented by the driver.  Thinking through it, the result should a Long where a null value means, essentially, infinite space available since most object stores don't really have the notion of free space ...

> 
> I'll implement a simple resource reservation at first.
> 
>> 
>>> 
>>>> 
>>>> We would also likely want to consider a TTL to purge files after a
>>>> configurable period of inactivity as a backstop against crashed
>>>> processes failing to properly decrementing the reference count.  In
>>>> this model, we will either defer or reject work if resources are not
>> available, and we properly bound resources.
>>> 
>>> Yes, it should be taken into consideration for all the time consuming
>> operations.
>>> 
>>>> 
>>>> Finally, in terms of decoupling the decision to use of this mechanism
>>>> by hypervisor plugins from the storage subsystem, I think we should
>>>> expose methods on the secondary storage services that allow clients
>>>> to explicitly request or create resources using files (i.e.
>>>> java.io.File) instead of streams (e.g. createXXX(File) or
>>>> readXXXAsFile).  These interfaces would provide the storage subsystem
>> with the hint that the client requires file access to the
>>>> request resource.   For object store plugins, this hint would be used to
>> wrap
>>>> the resource in an object that would transfer in and/out of the staging
>> area.
>>>> 
>>>> Thoughts?
>>>> -John
>>>> 
>>>> On Jun 3, 2013, at 7:17 PM, Edison Su <Ed...@citrix.com> wrote:
>>>> 
>>>>> Let's start a new thread about NFS cache storage issues on object_store.
>>>>> First, I'll go through how NFS storage works on master branch, then
>>>>> how it
>>>> works on object_store branch, then let's talk about the "issues".
>>>>> 
>>>>> 0.       Why we need NFS secondary storage? Nfs secondary storage is
>> used
>>>> as a place to store templates/snapshots etc, it's zone wide, and it's
>>>> widely supported by most of hypervisors(except HyperV). NFS storage
>>>> exists in CloudStack since 1.x. With the rising of object storage,
>>>> like S3/Swift, CloudStack adds the support of Swift in 3.x, and S3 in
>>>> 4.0. You may wonder, if S3/Swift is used as the place to store
>>>> templates/snapshots, then why we still need NFS secondary storage?
>>>>> 
>>>>> There are two reasons for that:
>>>>> 
>>>>> a.       CloudStack storage code is tightly coupled with NFS secondary
>> storage,
>>>> so when adding Swift/S3 support, it's likely to take shortcut, leave
>>>> NFS secondary storage as it is.
>>>>> 
>>>>> b.      Certain hypervisors, and certain storage related operations, can not
>>>> directly operate on object storage.
>>>>> Examples:
>>>>> 
>>>>> b.1 When backing up snapshot(the snapshot taken from xenserver
>>>>> hypervisor) from primary storage to S3 in xenserver
>>>>> 
>>>>> If there are snapshot chains on the volume, and if we want to
>>>>> coalesce the
>>>> snapshot chains into a new disk, then copy it to S3, we either,
>>>> coalesce the snapshot chains on primary storage, or on an extra
>>>> storage repository (SR) that supported by Xenserver.
>>>>> 
>>>>> If we coalesce it on primary storage, then may blow up the primary
>>>>> storage,
>>>> as the coalesced new disk may need a lot of space(thinking about, the
>>>> new disk will contain all the content in from leaf snapshot, all the
>>>> way up to base template), but the primary storage is not planned to
>>>> this operation(cloudstack mgt server is unaware of this operation,
>>>> the mgt server may think the primary storage still has enough space to
>> create volumes).
>>>>> 
>>>>> While xenserver doesn't have API to coalesce snapshots directly to
>>>>> S3, so
>>>> we have to use other storages that supported by Xenserver, that's why
>>>> the NFS storage is used during snapshot backup. So what we did is
>>>> that first call xenserver api to coalesce the snapshot to NFS
>>>> storage, then copy the newly created file into S3. This is what we
>>>> did on both master branch and object_store branch.
>>>>>                             b.2 When create volume from snapshot if
>>>>> the snapshot is
>>>> stored on S3.
>>>>>                                               If the snapshot is a
>>>>> delta snapshot, we need to
>>>> coalesce them into a new volume. We can't coalesce snapshots directly
>>>> on S3, AFAIK, so we have to download the snapshot and its parents
>>>> into somewhere, then coalesce them with xenserver's tools. Again,
>>>> there are two options, one is to download all the snapshots into
>>>> primary storage, or download them into NFS storage:
>>>>>                                              If we download all the
>>>>> snapshots into primary
>>>> storage directly from S3, then first we need find a way import
>>>> snapshot from
>>>> S3 into Primary storage(if primary storage is a block device, then
>>>> need extra
>>>> care) and then coalesce them. If we go this way, need to find a
>>>> primary storage with enough space, and even worse, if the primary
>>>> storage is not zone-wide, then later on, we may need to copy the
>>>> volume from one primary storage to another, which is time consuming.
>>>>>                                              If we download all the
>>>>> snapshots into NFS storage
>>>> from S3, then coalesce them, and then copy the volume to primary
>> storage.
>>>> As the NFS storage is zone wide, so, you can copy the volume into
>>>> whatever primary storage, without extra copy. This is what we did in
>>>> master branch and object_store branch.
>>>>>                            b.3, some hypervisors, or some storages
>>>>> do not support
>>>> directly import template into primary storage from a URL. For
>>>> example, if Ceph is used as primary storage, when import a template
>>>> into RBD, need transform a Qcow2 image into RAW disk, then into RBD
>>>> format 2. In order to transform an image from Qcow2 image into RAW
>>>> disk, you need extra file system, either a local file system(this is
>>>> what other stack does, which is not scalable to me), or a NFS
>>>> storage(this is what can be done on both master and object_store). Or
>>>> one can modify hypervisor or storage to support directly import
>>>> template from S3 into RBD. Here is the link(http://www.mail-
>>>> archive.com/ceph-devel@vger.kernel.org/msg14411.html), that Wido
>> posted.
>>>>>               Anyway, there are so many combination of hypervisors
>>>>> and
>>>> storages: for some hypervisors with zone wide file system based
>> storage(e.g.
>>>> KVM + gluster/NFS as primary storage), you don't need extra nfs storage.
>>>> Also if you are using VMware or HyperV, which can import template
>>>> from a URL, regardless which storage your are using, then you don't
>>>> need extra NFS storage. While if you are using xenserver, in order to
>>>> create volume from delta snapshot, you will need a NFS storage, or if
>>>> you are using KVM + Ceph, you also may need a NFS storage.
>>>>>              Due to above reasons, NFS cache storage is need in
>>>>> certain cases if
>>>> S3 is used as secondary storage. The combination of hypervisors and
>>>> storages are quite complicated, to use cache storage or not, should be
>> case by case.
>>>> But as long as cloudstack provides a framework, gives people the
>>>> choice to enable/disable cache storage on their own, then I think the
>>>> framework is good enough.
>>>>> 
>>>>> 
>>>>> 1.       Then let's talk about how NFS storage works on master branch,
>> with
>>>> or without S3.
>>>>> If S3 is not used, here is the how NFS storage is used:
>>>>> 
>>>>> 1.1   Register a template/ISO: cloudstack downloads the template/ISO
>> into
>>>> NFS storage.
>>>>> 
>>>>> 1.2   Backup snapshot: cloudstack sends a command to xenserver
>>>> hypervisor, issue vdi.copy command copy the snapshot to NFS, for kvm,
>>>> directly use "cp" or "qemu-img convert" to copy the snapshot into NFS
>>>> storage.
>>>>> 
>>>>> 1.3   Create volume from snapshot: If the snapshot is a delta snapshot,
>>>> coalesce them on NFS storage, then vdi.copy it from NFS to primary
>> storage.
>>>> If it's KVM, use "cp" or "qemu-img convert" to copy the snapshot from
>>>> NFS storage to primary storage.
>>>>> 
>>>>> 
>>>>>             If S3 is used:
>>>>> 
>>>>> 1.4   Register a template/ISO: download the template/ISO into NFS
>> storage
>>>> first, then there is background thread, which can upload the
>>>> template/ISO from NFS storage into S3 regularly. The template is in
>>>> Ready state, only means the template is stored on NFS storage, but
>>>> admin doesn't know the template is stored on the S3 or not. Even
>>>> worse, if there are multiple zones, cloudstack will copy the template
>>>> from one zone wide NFS storage into another NFS storage in another
>>>> zone, while there is already has a region wide
>>>> S3 available. As the template is not directly uploaded to S3 when
>>>> registering a template, it will take several copy in order to spread
>>>> the template into a region wide.
>>>>> 
>>>>> 1.5   Backup snapshot: cloudstack sends a command to xenserver
>>>> hypervisor, copy the snapshot to NFS storage, then immediately,
>>>> upload the snapshot from NFS storage into S3. The snapshot is in
>>>> Backedup state, not only means the snapshot is in  NFS storage, but also
>> means it's stored on S3.
>>>>> 
>>>>> 1.6   Create volume from snapshot: download the snapshot  and it's
>> parent
>>>> snapshots from S3 into NFS storage, then coalesce and vdi.copy the
>>>> volume from NFS to primary storage.
>>>>> 
>>>>> 
>>>>> 
>>>>> 2.       Then let's talk about how it works on object_store:
>>>>> If S3 is not used, there is ZERO change from master branch. How the
>>>>> NFS
>>>> secondary storage works before, is the same on object_store.
>>>>> If S3 is used, and NFS cache storage used also(which is by default):
>>>>> 2.1 Register a template/ISO: the template/ISO are directly uploaded
>>>>> to S3,
>>>> there is no extra copy to NFS storage. When the template is in "Ready"
>> state,
>>>> means the template is stored on S3.                  It implies that: the template
>> is
>>>> immediately available in the region as soon as it's in Ready State.
>>>> And admin can clearly knows the status of template on S3, what's
>>>> percentage of the uploading, is it failed or succeed? Also if
>>>> register template failed for some reason, admin can issue the
>>>> register template command again. I would say the change of how to
>>>> register template into S3 is far better than what we did on master branch.
>>>>> 2.2 Backup snapshot: it's same as master branch, sends a command to
>>>> xenserver host, copy the snapshot into NFS, then upload to S3.
>>>>> 2.3 Create volume from snapshot: it's the same as master branch,
>>>> download snapshot and it's parent snaphots from S3 into NFS, then
>>>> copy it from NFS to primary storage.
>>>>> From above few typical usage cases, you may understand how S3 and
>>>>> NFS
>>>> cache storage is used, and what's difference between object_store
>>>> branch and master branch: basically, we only change the way how to
>>>> register a template, nothing else.
>>>>> If S3 is used, and no NFS cache storage is used(it's possible,
>>>>> depends on
>>>> which datamotion strategy is used):
>>>>>  2.4 Register a template/ISO: it's the same as 2.1
>>>>>  2.5 Backup snapshot: export the snapshot from primary storage into
>>>>> S3
>>>> directly
>>>>>  2.6 Create volume from snapshot: download snapshots from S3 into
>>>> primary storage directly, then coalesce and create volume from it.
>>>>> 
>>>>>        Hope above explanation will tell the truth how the system
>>>>> works on
>>>> object_store, and clarify the misconception/misunderstanding  about
>>>> object_store branch. Even the change is huge, we still maintain the
>>>> back compatibility. If you don't want to use S3, only want to
>>>> existing NFS storage, it's definitely OK, it works the same as
>>>> before. If you want to use S3, we provide a better S3 implementation
>>>> when registering template/ISO. If you want to use S3 without NFS
>>>> storage, that's also definitely OK,  the framework is quite flexible to
>> accommodate different solutions.
>>>>> 
>>>>> Ok, let's talk  about the NFS storage cache issues.
>>>>> The issue about NFS cache storage is discussed in several threads,
>>>>> back and
>>>> forth. All in all, the NFs cache storage is only one usage case out
>>>> of three usage cases supported by object_store branch. It's not
>>>> something that if it has issue, then everything doesn't work.
>>>>> In above 2.2 and 2.3, it shows how the NFS cache storage is involved
>>>>> during
>>>> snapshot related operations. The complains about there is no aging
>>>> policy, no capacity planner for NFS cache storage, is happened when
>>>> download a snapshot from S3 into NFS, or copy a snapshot from primary
>>>> storage into NFS, or download template from S3 into NFS. Yes, it's an
>>>> issue, the NFS cache storage can be used out, if there is no capacity
>>>> planner, and no aging out policy. But can it be fixed? Is it a design issue?
>>>>> Let's talk the code: Here is the code related to NFS cache storage,
>>>>> not much, only one class depends on NFS cache storage:
>>>>> https://git-wip-us.apache.org/repos/asf?p=cloudstack.git;a=blob;f=en
>>>>> gi
>>>>> 
>>>> 
>> ne/storage/datamotion/src/org/apache/cloudstack/storage/motion/Ancien
>>>> t
>>>>> 
>>>> 
>> DataMotionStrategy.java;h=a01d2d30139f70ad8c907b6d6bc9759d47dcc2d6;h
>>>> b=
>>>>> refs/heads/object_store Take copyVolumeFromSnapshot as example,
>>>> which
>>>>> will be called when create Volume from snapshot, if first calls
>>>>> cacheSnapshotChain, which will call cacheMgr.createCacheObject to
>>>>> download the snapshot into NFs cache storage.
>>>>> StorageCacheManagerImpl-> createCacheObject is the only place to
>>>>> create objects on NFs cache storage, the code is at
>>>>> https://git-wip-us.apache.org/repos/asf?p=cloudstack.git;a=blob;f=en
>>>>> gi
>>>>> 
>>>> 
>> ne/storage/cache/src/org/apache/cloudstack/storage/cache/manager/Stor
>>>> a
>>>>> 
>>>> 
>> geCacheManagerImpl.java;h=cb5ea106fed3e5d2135dca7d98aede13effcf7d9;
>>>> hb=
>>>>> refs/heads/object_store In createCacheObject, it will first find out
>>>>> a cache storage, in case there are multiple cache storages available in a
>> scope:
>>>>> DataStore cacheStore = this.getCacheStorage(scope); getCacheStorage
>>>>> will call StorageCacheAllocator to find out a proper NFS cache
>>>>> storage. So
>>>> StorageCacheAllocator is the place to choose NFS cache storage based
>>>> on certain criteria, the current implementation only randomly choose
>>>> one of them, we can add a new allocator algorithm, based on capacity etc,
>> etc.
>>>>> Regarding capacity reservation, there is already a table, called
>>>> op_host_capacity which has entry for NFS secondary storage, we can
>>>> reuse this entry to store capacity information about NFS cache
>>>> storages(such as, total size, available/used capacity etc). So when
>>>> every call createCacheObject, we can call StorageCacheAllocator to
>>>> find out a proper NFS storage based on first fit criteria, then
>>>> increase used capacity in op_host_capacity table. If the create cache
>> object failed, return the capacity to op_host_capacity.
>>>>> 
>>>>> Regarding the aging out policy, we can start a background thread on
>>>>> mgt
>>>> server, which will scan all the objects created on NFS cache
>>>> storage(the tables called: snapshot_store_ref, template_store_ref,
>>>> volume_store_ref), each entry of these tables has a column called:
>>>> updated, every time, when the object's state is changed, the "updated"
>> column will be got updated also.
>>>> When the object's state is changed? Every time, when the object is
>>>> used in some contexts(such as copy the snapshot on NFS cache storage
>>>> into somewhere), the object's state will be changed  accordingly,
>>>> such as "Copying", means the object is being copied to some place,
>>>> which is exactly the information we need to implement LRU algorithm.
>>>>> 
>>>>> How do you guys think about the fix? If you have better solution,
>>>>> please let
>>>> me know.
>>>>> 
>>>>> 
>>> 
> 


RE: [DISCUSS] NFS cache storage issue on object_store

Posted by Edison Su <Ed...@citrix.com>.

> -----Original Message-----
> From: John Burwell [mailto:jburwell@basho.com]
> Sent: Thursday, June 06, 2013 7:47 AM
> To: dev@cloudstack.apache.org
> Subject: Re: [DISCUSS] NFS cache storage issue on object_store
> 
> Edison,
> 
> Please my comments in-line below.
> 
> Thanks,
> -John
> 
> On Jun 5, 2013, at 6:55 PM, Edison Su <Ed...@citrix.com> wrote:
> 
> >
> >
> >> -----Original Message-----
> >> From: John Burwell [mailto:jburwell@basho.com]
> >> Sent: Wednesday, June 05, 2013 1:04 PM
> >> To: dev@cloudstack.apache.org
> >> Subject: Re: [DISCUSS] NFS cache storage issue on object_store
> >>
> >> Edison,
> >>
> >> You have provided some great information below which helps greatly to
> >> understand the role of the "NFS cache" mechanism.  To summarize, this
> >> mechanism is only currently required for Xen snapshot operations
> >> driven by Xen's coalescing operations.  Is my understanding correct?
> >> Just out of
> >
> > I think Ceph may still need "NFS cache", for example, during delta snapshot
> backup:
> > http://ceph.com/dev-notes/incremental-snapshots-with-rbd/
> > You need to create a delta snapshot into a file, then upload the file into S3.
> >
> > For KVM, if the snapshot is taken on qcow2, then need to copy the
> snapshot into a file system, then backup it to S3.
> >
> > Another usage case for "NFS cache " is to cache template stored on S3, if
> there is no zone-wide primary storage. We need to download template from
> S3 into every primary storage, if there is no cache, each download will take a
> while: comparing download template directly from S3(if the S3 is region wide)
> with download from a zone wide "cache" storage, I would say, the download
> from zone wide cache storage should be faster than from region wide S3. If
> there is no zone wide primary storage, then we will download the template
> from S3 several times, which is quite time consuming.
> >
> >
> > There may have other places to use "NFS cache", but the point is as
> > long as mgt server can be decoupled from this "cache" storage, then we
> can decide when/how to use cache storage based on different kind of
> hypervisor/storage combinations in the future.
> 
> I think we would do well to re-orient the way we think about roles and
> requirements.  Ceph doesn't need a file system to perform a delta snapshot
> operation.  Xen, KVM, and/or VMWare need access to a file system to

For Ceph delta snapshot case, it's Ceph has the requirement that needs a file system to perform delta snapshot(http://ceph.com/docs/next/man/8/rbd/):

export-diff [image-name] [dest-path] [-from-snap snapname]
Exports an incremental diff for an image to dest path (use - for stdout). If an initial snapshot is specified, only changes since that snapshot are included; otherwise, any regions of the image that contain data are included. The end snapshot is specified using the standard -snap option or @snap syntax (see below). The image diff format includes metadata about image size changes, and the start and end snapshots. It efficiently represents discarded or 'zero' regions of the image.

The dest-path is either a file, or stdout, if using stdout, then need a lot of memory. If using hypervisor's local file system, then the local file system may don't have enough space to store the delta diff.

> perform these operations.  The hypervisor plugin should request a
> reservation of x size as a file handle from the Storage subsystem.  The Ceph
> driver implements this request by using a staging area + transfer operation.
> This approach encapsulates the operation/rules around the staging area from
> clients, protects against concurrent requests flooding a resource, and allows
> hypervisor-specific behavior/rules to encapsulated in the appropriate plugin.
> 
> >
> >> curiosity, is their a Xen expert on the list who can provide a
> >> high-level description of the coalescing operation -- in particular,
> >> the way it interacts with storage?  I have Googled a bit, and found very
> little information about it.
> >> Has the object_store branch been tested with VMWare and KVM?  If so,
> >> what operations on these hypervisors have been tested?
> >
> > Both vmware and KVM is tested, but without S3 support. Haven't have
> time to take a look at how to use S3 in both hypervisors yet.
> > For example, we should take a look at how to import a template from url
> into vmware data store, thus, we can eliminate "NFS cache" during template
> import.
> 
> Given the release extension and the impact of these tests on the
> implementation, we need to test S3 with VMWare and KVM pre-merge.

I would like to handle over the implementation of S3(directly use S3 without nfs staging area) on both Vmware and KVM to the community, or in the next release, or after the merge.
The reason, is simple, we need to get mgt server part refactor done at first, the hypervisor side implementation or optimization can be done after the mgt server side refactor. I think what we are doing at the mgt server side refactor paves the way for this kind of optimization at the hypervisor side.

> 
> >
> >>
> >> In reading through the description below, my operation concerns
> >> remain regarding potential race conditions and resource exhaustion.
> >> Also, in reading through the description, I think we should find a
> >> new name for this mechanism.  As Chip has previous mentioned, a cache
> >> implies the following
> >> characteristics:
> >>
> >>    1. Optional: Systems can operate without caches just more slowly.
> >> However, with this mechanism, snapshots on Xen will not function.
> >
> >
> > I agree on this one.
> >
> >>    2. Volatility: Caches are backed by durable, non-volitale storage.
> >> Therefore, if the cache's data is lost, it can be rebuilt from the
> >> backing store and no data will be permanently lost from the system.
> >> However, this mechanism contains snapshots in-transit to an object
> >> store.  If the data contained in this "cache" were lost before its
> >> transfer to the object store completed, the snapshot data would be lost.
> >
> > It's the same thing for file cache on Linux file system. If the file cache is not
> flushed into disk, while the machine lost power, then the data on the file
> cache is lost.
> > When we backup the snapshot from primary storage to S3, the snapshot is
> copied to "Nfs cache", then immediately, copied from "Nfs cache" into S3. If
> the snapshot on "Nfs cache" is lost, then the snapshot backup is failed. User
> can issue another backup snapshot command in this case.
> > So I don't think it's an issue.
> 
> The window of opportunity for data loss from a file system sync is much
> narrower for the Linux filesystem that for this staging area.  Furthermore,
> that risk can be largely (if not completely) mitigated with battery-backup
> hardware and/or conservative NFS settings.
> 
> For this staging area, the object store may be unreachable for an extended
> period of time (minutes, hours).  There are no cache flush settings or
> hardware solutions when it becomes unavailable.  If the data is lost from the
> staging area, it will be gone.  I think it is one of the largest issues with this
> approach, and we must be careful to ensure that data can not be lost before
> it is transferred out.

I agree. It's not I want use staging area, it's the limitation of hypervisor or storage, which can't directly transfer data in/out from S3 for some operations.
I think we agree on the limitation and issues with the staging area, but that's the current reality. 
If we want to remove staging area totally, need more resources to take a look at what can we do for each hypervisor, for each storage. We can't finish all the things in just one month.
If other people is willing to help us in this area, I'll appreciate. 

> 
> >
> >>
> >> In order to set expectations with users and better frame our design
> >> conversation, I think it would be appropriate this mechanism as a
> >> staging,
> >
> > Ok, seems cache is confusing people, we can use other term, or document
> it clearly, what's the role of the storage.
> > Yes, it's just a temporary file system, which can be used to store some
> temporary files.
> >
> >> scratch, or temporary area.  I also recommend removing the notion of
> >> NFS its name as NFS is initial implementation of this mechanism.  In
> >> the future, I can see a desire for local filesystem, RBD, and iSCSI
> implementations of it.
> >
> > Agree, any storage can be used as "Cache" storage. If you take a look at
> storagemanagerImpl->createCacheStore, it's nothing related to NFS.
> >
> >>
> >> In terms of solving the potential race conditions and resource
> >> exhaustion issues, I don't think an LRU approach will be sufficient
> >> because the least recently used resource may be still be in use by
> >> the system.  I think we should look to a reservation model with
> >> reference counting where files are deleted when once no processes are
> >> accessing them.  The following is a
> >> (handwave-handwave) overview of the process I think would meet these
> >> requirements:
> >>
> >> 	1. Request a reservation for the maximum size of the file(s) that
> >> will be processed in the staging area.
> >> 		- If the file is already in the staging area, increase its
> >> reference count
> >> 		- If the reservation can not be fulfilled, we can either drop
> the
> >> process in a retry queue or reject it.
> >> 	2. Perform work and transfer file(s) to/from the object store
> >> 	3. Release the file(s) -- decrementing the reference count.  When
> >> the reference count is <= 0, delete the file(s) from the staging area
> >
> > I assume the reference count is stored in memory and inside SSVM?
> > The reference count may not work properly, in case of multiple secondary
> storage VMs and multiple mgt servers. And there may have a lot of places
> other than SSVM can directly use the cached object.
> > If we store the reference count on file system, then need to take a
> lock(such as nfs lock, or lock file)to update, while the lock can be failed to
> release, due to all kind of reasons(such as network).
> 
> We could implement reference counting in a number of ways.  The first
> would be increment a value in the database before command submission to
> the SSVM, and decrement as part of answer processing.  We could evaluate

I agree, we can add a ref count column in template/volume/snapshot_store_ref, which can track how many read users of the cached object.

> using a distributed framework such as Hazelcast (http://www.hazelcast.com)
> which provides a distributed countdown latch
> (http://www.hazelcast.com/docs/1.9.4/javadoc/com/hazelcast/core/ICount
> DownLatch.html) across the SSVMs.  We need to avoid POSIX-style file


Good to know.

> system locks because they are not consistently implemented/available (e.g.
> OCFS2).
> 
> My first brush thoughts on it would be to use a database table in 4.2, and
> evaluate adopting a something like Hazelcast in 4.3.  Personally, I would like
> to see us move away from relying on relational database semantic to
> implement distributed data structures (counters, locks, etc).  However, given
> the time pressures, I don't think we have the time properly evaluate the
> impact of adopting a more general purpose distributed framework in 4.2.

I agree.

> 
> From a code perspective, I think it would behove us to implement a more
> functional approach to command execution in order to ensure reference
> counting, error handling, resource management are handled in a consistent
> manner.  I implemented such an approach in
> com.cloud.utils.db.GlobalLock#executeWithLock where locking around a
> particular operation is managed separately form the actual operation being
> performed.

I'll take a look at your implementation.

> 
> >
> > I thought about it yesterday, about how to implement LRU. Originally,
> > I though, we could eliminate race condition and track who is using objects
> stored on cache storage by using state machine For example, whenever mgt
> server wants to use the cached object, mgt server can change the state for
> the cached object to "Copying"(there is a DB entry for each cached object),
> after the copy is finished, then change the state into "Ready", and also
> update "updated" column. It will eliminate the race condition, as only one
> thread can access the cached object, and change its state. But the problem of
> this way, is that, there are cases that multiple reader threads may want to
> read the cached object at the same time: e.g. copy the same cached
> template to multiple primary storages at the same time.
> >
> > In order to accommodate multiple readers, I am trying to add a new db
> table to track the users of  the cached object.
> > The follow will be like the following:
> > 1. mgt server wants to use the cached object, first, need to check the state
> of the cached object, the state must be in ready state.
> > 2. mgt server writes a db entry into DB, the entry will contain, the id of
> cached object, the id of cached storage, the issued time. The db entry will
> also contain a state: the state can be initial/processing/finished/failed. Mgt
> server needs to set the state as "processing".
> > 3. mgt server finishes the operation related the cached object, then mark
> state of above db entry as "finished",  also update the time column of above
> entry.
> > 4. the above db entries will be removed if the state is not in "processing"
> for a while(let's say one week?), or if the entry is in the "processing" state for
> a while(let's say one day). In this way, mgt server can easily know which
> cached object is used or not used recently, by take a look this db table.
> > 5. If mgt server find a cached object is not used(there is no db entry in the
> above table) for a while(let's say one week), then change the state of the
> cached object into "destroying", then send command to ssvm to destroy the
> object.
> > 6. There is small window, that mgt server is changing the state of cached
> object into "destroying"(there is no db entry is in "processing" state in the
> above table,), while another thread is trying to copying(as the cached object
> state is still in ready state), both DB operations will success, we can hold a DB
> lock on the cached object entry, before both DB opeations.
> >
> > How do you think?
> 
> The issue remains that is the least recently used (really accessed) object can
> still be in use by a running process.  One example that pops to mind is a
> popular, large template that has a set of longish running processes creating
> from it.  As I described above, I think you can change issued time to a
> reference count, and add logic to step 3 to decrement/check the object
> count.  With the proper transaction semantics, we provide sufficient
> consistency guarantees around a reference count.

Agree. I only need to track how many readers which are currently using cached object. So a ref cnt is enough, I don't even need to create a new db table to track the ref cnt, add a new refcnt column on the template/snapshot/volume_store_ref is good enough. Every time the ref cnt is updated, the "updated" column is got updated also, so that based on the ref cnt column and updated column, mgt server will know is there any other users using the cached object, and what's the last time, the cached object got used, then implement a LRU reclaim algorithm.

> 
> The other part that we must accommodate is resource reservation.  Client
> need to declare the anticipated size of their use before starting an operation.
> The Storage needs to track the amount of space committed vs. used, and fail
> fast when it is clear that the system will not have the resources available to
> fulfill a request.  For 4.2, I think we don;t have the time implement a robust
> queueing/best efforts facility.  For 4.2, I think a checked exception indicating
> temporary resource unavailability will be sufficient for clients to determine
> the best course of recovery action (i.e. error out or retry).

The  resource reservation is something that haven't done well in cloudstack for a long time. There is no proper resource reservation for all the storage related operations, it's likely, the storage will get used out, if there are concurrent volume creation operations, as there is no lock at the mgt server to check/update storage capacity. 
What I am trying to implement resource reservation is that:
1. For each storage(primary/secondary, or staging area) has a db entry in op_host_capacity, which contains the used/allocated/total size of each storage.
2. For each allocation operation(there is a common entry: datastore->create/delete), need to update above db entry in an atomic way:
Either hold a DB row lock, then update, or implement a CompareAndSet method, so that in case of concurrent storage create/delete operations, the capacity is been updated properly.
3. Before each capacity update, if the used/total is beyond a certain threshold, then failed.

There are some known issues with the resource reservation:
1. The size of certain objects are unknown during the resource reservation, such as the template size(we may need to call httpclient on the mgt server to get the size of template, as the template is not been downloaded into secondary storage, in case of register template), or the snapshot size(in case of copying snapshot from primary to nfs staging area, the mgt server doesn't know the size of snapshot before issuing the copy command, so don't know how to make the resource reservation)
2. Due to above issue 1, capacity db table is out-of-sync with the actual storage usage. No matter how carefully coded at the mgt server, capacity info in DB can be out-of-sync with actual physical capacity. Need to sync with the info returned by GetStorageStatsCommand.
3. Storage over provisioning: current only NFS storage can do over provisioning, but I think it should be decided by each storage provider.

I'll implement a simple resource reservation at first.

> 
> >
> >>
> >> We would also likely want to consider a TTL to purge files after a
> >> configurable period of inactivity as a backstop against crashed
> >> processes failing to properly decrementing the reference count.  In
> >> this model, we will either defer or reject work if resources are not
> available, and we properly bound resources.
> >
> > Yes, it should be taken into consideration for all the time consuming
> operations.
> >
> >>
> >> Finally, in terms of decoupling the decision to use of this mechanism
> >> by hypervisor plugins from the storage subsystem, I think we should
> >> expose methods on the secondary storage services that allow clients
> >> to explicitly request or create resources using files (i.e.
> >> java.io.File) instead of streams (e.g. createXXX(File) or
> >> readXXXAsFile).  These interfaces would provide the storage subsystem
> with the hint that the client requires file access to the
> >> request resource.   For object store plugins, this hint would be used to
> wrap
> >> the resource in an object that would transfer in and/out of the staging
> area.
> >>
> >> Thoughts?
> >> -John
> >>
> >> On Jun 3, 2013, at 7:17 PM, Edison Su <Ed...@citrix.com> wrote:
> >>
> >>> Let's start a new thread about NFS cache storage issues on object_store.
> >>> First, I'll go through how NFS storage works on master branch, then
> >>> how it
> >> works on object_store branch, then let's talk about the "issues".
> >>>
> >>> 0.       Why we need NFS secondary storage? Nfs secondary storage is
> used
> >> as a place to store templates/snapshots etc, it's zone wide, and it's
> >> widely supported by most of hypervisors(except HyperV). NFS storage
> >> exists in CloudStack since 1.x. With the rising of object storage,
> >> like S3/Swift, CloudStack adds the support of Swift in 3.x, and S3 in
> >> 4.0. You may wonder, if S3/Swift is used as the place to store
> >> templates/snapshots, then why we still need NFS secondary storage?
> >>>
> >>> There are two reasons for that:
> >>>
> >>> a.       CloudStack storage code is tightly coupled with NFS secondary
> storage,
> >> so when adding Swift/S3 support, it's likely to take shortcut, leave
> >> NFS secondary storage as it is.
> >>>
> >>> b.      Certain hypervisors, and certain storage related operations, can not
> >> directly operate on object storage.
> >>> Examples:
> >>>
> >>> b.1 When backing up snapshot(the snapshot taken from xenserver
> >>> hypervisor) from primary storage to S3 in xenserver
> >>>
> >>> If there are snapshot chains on the volume, and if we want to
> >>> coalesce the
> >> snapshot chains into a new disk, then copy it to S3, we either,
> >> coalesce the snapshot chains on primary storage, or on an extra
> >> storage repository (SR) that supported by Xenserver.
> >>>
> >>> If we coalesce it on primary storage, then may blow up the primary
> >>> storage,
> >> as the coalesced new disk may need a lot of space(thinking about, the
> >> new disk will contain all the content in from leaf snapshot, all the
> >> way up to base template), but the primary storage is not planned to
> >> this operation(cloudstack mgt server is unaware of this operation,
> >> the mgt server may think the primary storage still has enough space to
> create volumes).
> >>>
> >>> While xenserver doesn't have API to coalesce snapshots directly to
> >>> S3, so
> >> we have to use other storages that supported by Xenserver, that's why
> >> the NFS storage is used during snapshot backup. So what we did is
> >> that first call xenserver api to coalesce the snapshot to NFS
> >> storage, then copy the newly created file into S3. This is what we
> >> did on both master branch and object_store branch.
> >>>                              b.2 When create volume from snapshot if
> >>> the snapshot is
> >> stored on S3.
> >>>                                                If the snapshot is a
> >>> delta snapshot, we need to
> >> coalesce them into a new volume. We can't coalesce snapshots directly
> >> on S3, AFAIK, so we have to download the snapshot and its parents
> >> into somewhere, then coalesce them with xenserver's tools. Again,
> >> there are two options, one is to download all the snapshots into
> >> primary storage, or download them into NFS storage:
> >>>                                               If we download all the
> >>> snapshots into primary
> >> storage directly from S3, then first we need find a way import
> >> snapshot from
> >> S3 into Primary storage(if primary storage is a block device, then
> >> need extra
> >> care) and then coalesce them. If we go this way, need to find a
> >> primary storage with enough space, and even worse, if the primary
> >> storage is not zone-wide, then later on, we may need to copy the
> >> volume from one primary storage to another, which is time consuming.
> >>>                                               If we download all the
> >>> snapshots into NFS storage
> >> from S3, then coalesce them, and then copy the volume to primary
> storage.
> >> As the NFS storage is zone wide, so, you can copy the volume into
> >> whatever primary storage, without extra copy. This is what we did in
> >> master branch and object_store branch.
> >>>                             b.3, some hypervisors, or some storages
> >>> do not support
> >> directly import template into primary storage from a URL. For
> >> example, if Ceph is used as primary storage, when import a template
> >> into RBD, need transform a Qcow2 image into RAW disk, then into RBD
> >> format 2. In order to transform an image from Qcow2 image into RAW
> >> disk, you need extra file system, either a local file system(this is
> >> what other stack does, which is not scalable to me), or a NFS
> >> storage(this is what can be done on both master and object_store). Or
> >> one can modify hypervisor or storage to support directly import
> >> template from S3 into RBD. Here is the link(http://www.mail-
> >> archive.com/ceph-devel@vger.kernel.org/msg14411.html), that Wido
> posted.
> >>>                Anyway, there are so many combination of hypervisors
> >>> and
> >> storages: for some hypervisors with zone wide file system based
> storage(e.g.
> >> KVM + gluster/NFS as primary storage), you don't need extra nfs storage.
> >> Also if you are using VMware or HyperV, which can import template
> >> from a URL, regardless which storage your are using, then you don't
> >> need extra NFS storage. While if you are using xenserver, in order to
> >> create volume from delta snapshot, you will need a NFS storage, or if
> >> you are using KVM + Ceph, you also may need a NFS storage.
> >>>               Due to above reasons, NFS cache storage is need in
> >>> certain cases if
> >> S3 is used as secondary storage. The combination of hypervisors and
> >> storages are quite complicated, to use cache storage or not, should be
> case by case.
> >> But as long as cloudstack provides a framework, gives people the
> >> choice to enable/disable cache storage on their own, then I think the
> >> framework is good enough.
> >>>
> >>>
> >>> 1.       Then let's talk about how NFS storage works on master branch,
> with
> >> or without S3.
> >>> If S3 is not used, here is the how NFS storage is used:
> >>>
> >>> 1.1   Register a template/ISO: cloudstack downloads the template/ISO
> into
> >> NFS storage.
> >>>
> >>> 1.2   Backup snapshot: cloudstack sends a command to xenserver
> >> hypervisor, issue vdi.copy command copy the snapshot to NFS, for kvm,
> >> directly use "cp" or "qemu-img convert" to copy the snapshot into NFS
> >> storage.
> >>>
> >>> 1.3   Create volume from snapshot: If the snapshot is a delta snapshot,
> >> coalesce them on NFS storage, then vdi.copy it from NFS to primary
> storage.
> >> If it's KVM, use "cp" or "qemu-img convert" to copy the snapshot from
> >> NFS storage to primary storage.
> >>>
> >>>
> >>>              If S3 is used:
> >>>
> >>> 1.4   Register a template/ISO: download the template/ISO into NFS
> storage
> >> first, then there is background thread, which can upload the
> >> template/ISO from NFS storage into S3 regularly. The template is in
> >> Ready state, only means the template is stored on NFS storage, but
> >> admin doesn't know the template is stored on the S3 or not. Even
> >> worse, if there are multiple zones, cloudstack will copy the template
> >> from one zone wide NFS storage into another NFS storage in another
> >> zone, while there is already has a region wide
> >> S3 available. As the template is not directly uploaded to S3 when
> >> registering a template, it will take several copy in order to spread
> >> the template into a region wide.
> >>>
> >>> 1.5   Backup snapshot: cloudstack sends a command to xenserver
> >> hypervisor, copy the snapshot to NFS storage, then immediately,
> >> upload the snapshot from NFS storage into S3. The snapshot is in
> >> Backedup state, not only means the snapshot is in  NFS storage, but also
> means it's stored on S3.
> >>>
> >>> 1.6   Create volume from snapshot: download the snapshot  and it's
> parent
> >> snapshots from S3 into NFS storage, then coalesce and vdi.copy the
> >> volume from NFS to primary storage.
> >>>
> >>>
> >>>
> >>> 2.       Then let's talk about how it works on object_store:
> >>> If S3 is not used, there is ZERO change from master branch. How the
> >>> NFS
> >> secondary storage works before, is the same on object_store.
> >>> If S3 is used, and NFS cache storage used also(which is by default):
> >>>  2.1 Register a template/ISO: the template/ISO are directly uploaded
> >>> to S3,
> >> there is no extra copy to NFS storage. When the template is in "Ready"
> state,
> >> means the template is stored on S3.                  It implies that: the template
> is
> >> immediately available in the region as soon as it's in Ready State.
> >> And admin can clearly knows the status of template on S3, what's
> >> percentage of the uploading, is it failed or succeed? Also if
> >> register template failed for some reason, admin can issue the
> >> register template command again. I would say the change of how to
> >> register template into S3 is far better than what we did on master branch.
> >>>  2.2 Backup snapshot: it's same as master branch, sends a command to
> >> xenserver host, copy the snapshot into NFS, then upload to S3.
> >>>  2.3 Create volume from snapshot: it's the same as master branch,
> >> download snapshot and it's parent snaphots from S3 into NFS, then
> >> copy it from NFS to primary storage.
> >>> From above few typical usage cases, you may understand how S3 and
> >>> NFS
> >> cache storage is used, and what's difference between object_store
> >> branch and master branch: basically, we only change the way how to
> >> register a template, nothing else.
> >>> If S3 is used, and no NFS cache storage is used(it's possible,
> >>> depends on
> >> which datamotion strategy is used):
> >>>   2.4 Register a template/ISO: it's the same as 2.1
> >>>   2.5 Backup snapshot: export the snapshot from primary storage into
> >>> S3
> >> directly
> >>>   2.6 Create volume from snapshot: download snapshots from S3 into
> >> primary storage directly, then coalesce and create volume from it.
> >>>
> >>>         Hope above explanation will tell the truth how the system
> >>> works on
> >> object_store, and clarify the misconception/misunderstanding  about
> >> object_store branch. Even the change is huge, we still maintain the
> >> back compatibility. If you don't want to use S3, only want to
> >> existing NFS storage, it's definitely OK, it works the same as
> >> before. If you want to use S3, we provide a better S3 implementation
> >> when registering template/ISO. If you want to use S3 without NFS
> >> storage, that's also definitely OK,  the framework is quite flexible to
> accommodate different solutions.
> >>>
> >>> Ok, let's talk  about the NFS storage cache issues.
> >>> The issue about NFS cache storage is discussed in several threads,
> >>> back and
> >> forth. All in all, the NFs cache storage is only one usage case out
> >> of three usage cases supported by object_store branch. It's not
> >> something that if it has issue, then everything doesn't work.
> >>> In above 2.2 and 2.3, it shows how the NFS cache storage is involved
> >>> during
> >> snapshot related operations. The complains about there is no aging
> >> policy, no capacity planner for NFS cache storage, is happened when
> >> download a snapshot from S3 into NFS, or copy a snapshot from primary
> >> storage into NFS, or download template from S3 into NFS. Yes, it's an
> >> issue, the NFS cache storage can be used out, if there is no capacity
> >> planner, and no aging out policy. But can it be fixed? Is it a design issue?
> >>> Let's talk the code: Here is the code related to NFS cache storage,
> >>> not much, only one class depends on NFS cache storage:
> >>> https://git-wip-us.apache.org/repos/asf?p=cloudstack.git;a=blob;f=en
> >>> gi
> >>>
> >>
> ne/storage/datamotion/src/org/apache/cloudstack/storage/motion/Ancien
> >> t
> >>>
> >>
> DataMotionStrategy.java;h=a01d2d30139f70ad8c907b6d6bc9759d47dcc2d6;h
> >> b=
> >>> refs/heads/object_store Take copyVolumeFromSnapshot as example,
> >> which
> >>> will be called when create Volume from snapshot, if first calls
> >>> cacheSnapshotChain, which will call cacheMgr.createCacheObject to
> >>> download the snapshot into NFs cache storage.
> >>> StorageCacheManagerImpl-> createCacheObject is the only place to
> >>> create objects on NFs cache storage, the code is at
> >>> https://git-wip-us.apache.org/repos/asf?p=cloudstack.git;a=blob;f=en
> >>> gi
> >>>
> >>
> ne/storage/cache/src/org/apache/cloudstack/storage/cache/manager/Stor
> >> a
> >>>
> >>
> geCacheManagerImpl.java;h=cb5ea106fed3e5d2135dca7d98aede13effcf7d9;
> >> hb=
> >>> refs/heads/object_store In createCacheObject, it will first find out
> >>> a cache storage, in case there are multiple cache storages available in a
> scope:
> >>> DataStore cacheStore = this.getCacheStorage(scope); getCacheStorage
> >>> will call StorageCacheAllocator to find out a proper NFS cache
> >>> storage. So
> >> StorageCacheAllocator is the place to choose NFS cache storage based
> >> on certain criteria, the current implementation only randomly choose
> >> one of them, we can add a new allocator algorithm, based on capacity etc,
> etc.
> >>> Regarding capacity reservation, there is already a table, called
> >> op_host_capacity which has entry for NFS secondary storage, we can
> >> reuse this entry to store capacity information about NFS cache
> >> storages(such as, total size, available/used capacity etc). So when
> >> every call createCacheObject, we can call StorageCacheAllocator to
> >> find out a proper NFS storage based on first fit criteria, then
> >> increase used capacity in op_host_capacity table. If the create cache
> object failed, return the capacity to op_host_capacity.
> >>>
> >>> Regarding the aging out policy, we can start a background thread on
> >>> mgt
> >> server, which will scan all the objects created on NFS cache
> >> storage(the tables called: snapshot_store_ref, template_store_ref,
> >> volume_store_ref), each entry of these tables has a column called:
> >> updated, every time, when the object's state is changed, the "updated"
> column will be got updated also.
> >> When the object's state is changed? Every time, when the object is
> >> used in some contexts(such as copy the snapshot on NFS cache storage
> >> into somewhere), the object's state will be changed  accordingly,
> >> such as "Copying", means the object is being copied to some place,
> >> which is exactly the information we need to implement LRU algorithm.
> >>>
> >>> How do you guys think about the fix? If you have better solution,
> >>> please let
> >> me know.
> >>>
> >>>
> >


Re: [DISCUSS] NFS cache storage issue on object_store

Posted by John Burwell <jb...@basho.com>.
Chiradeep,

Looks like I have a rookie mistake in S3-backed Secondary Storage.  I will investigate, and send a patch to lock on the management server side.

Thanks,
-John

On Jun 10, 2013, at 1:30 PM, Chiradeep Vittal <Ch...@citrix.com> wrote:

> 
>> 
>> From a code perspective, I think it would behove us to implement a more
>> functional approach to command execution in order to ensure reference
>> counting, error handling, resource management are handled in a consistent
>> manner.  I implemented such an approach in
>> com.cloud.utils.db.GlobalLock#executeWithLock where locking around a
>> particular operation is managed separately form the actual operation
>> being performed.
> 
> This one confused me: this is called from NfsSecondaryStorageResource
> which does not have access to the management server database. Since if you
> call the lock() function, it will try to access a db row.
> 


Re: [DISCUSS] NFS cache storage issue on object_store

Posted by Chiradeep Vittal <Ch...@citrix.com>.
>
>From a code perspective, I think it would behove us to implement a more
>functional approach to command execution in order to ensure reference
>counting, error handling, resource management are handled in a consistent
>manner.  I implemented such an approach in
>com.cloud.utils.db.GlobalLock#executeWithLock where locking around a
>particular operation is managed separately form the actual operation
>being performed.

This one confused me: this is called from NfsSecondaryStorageResource
which does not have access to the management server database. Since if you
call the lock() function, it will try to access a db row.


Re: [DISCUSS] NFS cache storage issue on object_store

Posted by John Burwell <jb...@basho.com>.
Edison,

Please my comments in-line below.

Thanks,
-John

On Jun 5, 2013, at 6:55 PM, Edison Su <Ed...@citrix.com> wrote:

> 
> 
>> -----Original Message-----
>> From: John Burwell [mailto:jburwell@basho.com]
>> Sent: Wednesday, June 05, 2013 1:04 PM
>> To: dev@cloudstack.apache.org
>> Subject: Re: [DISCUSS] NFS cache storage issue on object_store
>> 
>> Edison,
>> 
>> You have provided some great information below which helps greatly to
>> understand the role of the "NFS cache" mechanism.  To summarize, this
>> mechanism is only currently required for Xen snapshot operations driven by
>> Xen's coalescing operations.  Is my understanding correct?  Just out of
> 
> I think Ceph may still need "NFS cache", for example, during delta snapshot backup:
> http://ceph.com/dev-notes/incremental-snapshots-with-rbd/
> You need to create a delta snapshot into a file, then upload the file into S3.
> 
> For KVM, if the snapshot is taken on qcow2, then need to copy the snapshot into a file system, then backup it to S3.
> 
> Another usage case for "NFS cache " is to cache template stored on S3, if there is no zone-wide primary storage. We need to download template from S3 into every primary storage, if there is no cache, each download will take a while: comparing download template directly from S3(if the S3 is region wide) with download from a zone wide "cache" storage, I would say, the download from zone wide cache storage should be faster than from region wide S3. If there is no zone wide primary storage, then we will download the template from S3 several times, which is quite time consuming.
> 
> 
> There may have other places to use "NFS cache", but the point is as long as mgt server can be decoupled from this "cache" storage, then we can 
> decide when/how to use cache storage based on different kind of hypervisor/storage combinations in the future.

I think we would do well to re-orient the way we think about roles and requirements.  Ceph doesn't need a file system to perform a delta snapshot operation.  Xen, KVM, and/or VMWare need access to a file system to perform these operations.  The hypervisor plugin should request a reservation of x size as a file handle from the Storage subsystem.  The Ceph driver implements this request by using a staging area + transfer operation.  This approach encapsulates the operation/rules around the staging area from clients, protects against concurrent requests flooding a resource, and allows hypervisor-specific behavior/rules to encapsulated in the appropriate plugin.

> 
>> curiosity, is their a Xen expert on the list who can provide a high-level
>> description of the coalescing operation -- in particular, the way it interacts
>> with storage?  I have Googled a bit, and found very little information about it.
>> Has the object_store branch been tested with VMWare and KVM?  If so,
>> what operations on these hypervisors have been tested?
> 
> Both vmware and KVM is tested, but without S3 support. Haven't have time to take a look at how to use S3 in both hypervisors yet. 
> For example, we should take a look at how to import a template from url into vmware data store, thus, we can eliminate "NFS cache" during template import.

Given the release extension and the impact of these tests on the implementation, we need to test S3 with VMWare and KVM pre-merge.

> 
>> 
>> In reading through the description below, my operation concerns remain
>> regarding potential race conditions and resource exhaustion.  Also, in reading
>> through the description, I think we should find a new name for this
>> mechanism.  As Chip has previous mentioned, a cache implies the following
>> characteristics:
>> 
>>    1. Optional: Systems can operate without caches just more slowly.
>> However, with this mechanism, snapshots on Xen will not function.
> 
> 
> I agree on this one.
> 
>>    2. Volatility: Caches are backed by durable, non-volitale storage.  Therefore,
>> if the cache's data is lost, it can be rebuilt from the backing store and no data
>> will be permanently lost from the system.  However, this mechanism
>> contains snapshots in-transit to an object store.  If the data contained in this
>> "cache" were lost before its transfer to the object store completed, the
>> snapshot data would be lost.
> 
> It's the same thing for file cache on Linux file system. If the file cache is not flushed into disk, while the machine lost power, then the data on the file cache is lost.
> When we backup the snapshot from primary storage to S3, the snapshot is copied to "Nfs cache", then immediately, copied from "Nfs cache" into S3. If the snapshot on "Nfs cache" is lost, then the snapshot backup is failed. User can issue another backup snapshot command in this case. 
> So I don't think it's an issue.

The window of opportunity for data loss from a file system sync is much narrower for the Linux filesystem that for this staging area.  Furthermore, that risk can be largely (if not completely) mitigated with battery-backup hardware and/or conservative NFS settings.  

For this staging area, the object store may be unreachable for an extended period of time (minutes, hours).  There are no cache flush settings or hardware solutions when it becomes unavailable.  If the data is lost from the staging area, it will be gone.  I think it is one of the largest issues with this approach, and we must be careful to ensure that data can not be lost before it is transferred out.

> 
>> 
>> In order to set expectations with users and better frame our design
>> conversation, I think it would be appropriate this mechanism as a staging,
> 
> Ok, seems cache is confusing people, we can use other term, or document it clearly, what's the role of the storage.
> Yes, it's just a temporary file system, which can be used to store some temporary files.
> 
>> scratch, or temporary area.  I also recommend removing the notion of NFS its
>> name as NFS is initial implementation of this mechanism.  In the future, I can
>> see a desire for local filesystem, RBD, and iSCSI implementations of it.
> 
> Agree, any storage can be used as "Cache" storage. If you take a look at storagemanagerImpl->createCacheStore, it's nothing related to NFS.
> 
>> 
>> In terms of solving the potential race conditions and resource exhaustion
>> issues, I don't think an LRU approach will be sufficient because the least
>> recently used resource may be still be in use by the system.  I think we
>> should look to a reservation model with reference counting where files are
>> deleted when once no processes are accessing them.  The following is a
>> (handwave-handwave) overview of the process I think would meet these
>> requirements:
>> 
>> 	1. Request a reservation for the maximum size of the file(s) that will
>> be processed in the staging area.
>> 		- If the file is already in the staging area, increase its
>> reference count
>> 		- If the reservation can not be fulfilled, we can either drop
>> the process in a retry queue or reject it.
>> 	2. Perform work and transfer file(s) to/from the object store
>> 	3. Release the file(s) -- decrementing the reference count.  When
>> the reference count is <= 0, delete the file(s) from the staging area
> 
> I assume the reference count is stored in memory and inside SSVM?
> The reference count may not work properly, in case of multiple secondary storage VMs and multiple mgt servers. And there may have a lot of places other than SSVM can directly use the cached object.
> If we store the reference count on file system, then need to take a lock(such as nfs lock, or lock file)to update, while the lock can be failed to release, due to all kind of reasons(such as network).

We could implement reference counting in a number of ways.  The first would be increment a value in the database before command submission to the SSVM, and decrement as part of answer processing.  We could evaluate using a distributed framework such as Hazelcast (http://www.hazelcast.com) which provides a distributed countdown latch (http://www.hazelcast.com/docs/1.9.4/javadoc/com/hazelcast/core/ICountDownLatch.html) across the SSVMs.  We need to avoid POSIX-style file system locks because they are not consistently implemented/available (e.g. OCFS2).  

My first brush thoughts on it would be to use a database table in 4.2, and evaluate adopting a something like Hazelcast in 4.3.  Personally, I would like to see us move away from relying on relational database semantic to implement distributed data structures (counters, locks, etc).  However, given the time pressures, I don't think we have the time properly evaluate the impact of adopting a more general purpose distributed framework in 4.2.  

From a code perspective, I think it would behove us to implement a more functional approach to command execution in order to ensure reference counting, error handling, resource management are handled in a consistent manner.  I implemented such an approach in com.cloud.utils.db.GlobalLock#executeWithLock where locking around a particular operation is managed separately form the actual operation being performed.

> 
> I thought about it yesterday, about how to implement LRU. Originally, I though, we could eliminate race condition and track who is using objects stored on cache storage by using state machine
> For example, whenever mgt server wants to use the cached object, mgt server can change the state for the cached object to "Copying"(there is a DB entry for each cached object), after the copy is finished, then change the state into "Ready", and also update "updated" column. It will eliminate the race condition, as only one thread can access the cached object, and change its state. But the problem of this way, is that, there are cases that multiple reader threads may want to read the cached object at the same time: e.g. copy the same cached template to multiple primary storages at the same time.
> 
> In order to accommodate multiple readers, I am trying to add a new db table to track the users of  the cached object.
> The follow will be like the following:
> 1. mgt server wants to use the cached object, first, need to check the state of the cached object, the state must be in ready state.
> 2. mgt server writes a db entry into DB, the entry will contain, the id of cached object, the id of cached storage, the issued time. The db entry will also contain a state: the state can be initial/processing/finished/failed. Mgt server needs to set the state as "processing".
> 3. mgt server finishes the operation related the cached object, then mark state of above db entry as "finished",  also update the time column of above entry.
> 4. the above db entries will be removed if the state is not in "processing" for a while(let's say one week?), or if the entry is in the "processing" state for a while(let's say one day). In this way, mgt server can easily know which cached object is used or not used recently, by take a look this db table.
> 5. If mgt server find a cached object is not used(there is no db entry in the above table) for a while(let's say one week), then change the state of the cached object into "destroying", then send command to ssvm to destroy the object.
> 6. There is small window, that mgt server is changing the state of cached object into "destroying"(there is no db entry is in "processing" state in the above table,), while another thread is trying to copying(as the cached object state is still in ready state), both DB operations will success, we can hold a DB lock on the cached object entry, before both DB opeations.
> 
> How do you think?

The issue remains that is the least recently used (really accessed) object can still be in use by a running process.  One example that pops to mind is a popular, large template that has a set of longish running processes creating from it.  As I described above, I think you can change issued time to a reference count, and add logic to step 3 to decrement/check the object count.  With the proper transaction semantics, we provide sufficient consistency guarantees around a reference count.

The other part that we must accommodate is resource reservation.  Client need to declare the anticipated size of their use before starting an operation.  The Storage needs to track the amount of space committed vs. used, and fail fast when it is clear that the system will not have the resources available to fulfill a request.  For 4.2, I think we don;t have the time implement a robust queueing/best efforts facility.  For 4.2, I think a checked exception indicating temporary resource unavailability will be sufficient for clients to determine the best course of recovery action (i.e. error out or retry).

> 
>> 
>> We would also likely want to consider a TTL to purge files after a configurable
>> period of inactivity as a backstop against crashed processes failing to properly
>> decrementing the reference count.  In this model, we will either defer or
>> reject work if resources are not available, and we properly bound resources.
> 
> Yes, it should be taken into consideration for all the time consuming operations.
> 
>> 
>> Finally, in terms of decoupling the decision to use of this mechanism by
>> hypervisor plugins from the storage subsystem, I think we should expose
>> methods on the secondary storage services that allow clients to explicitly
>> request or create resources using files (i.e. java.io.File) instead of streams
>> (e.g. createXXX(File) or readXXXAsFile).  These interfaces would provide the
>> storage subsystem with the hint that the client requires file access to the
>> request resource.   For object store plugins, this hint would be used to wrap
>> the resource in an object that would transfer in and/out of the staging area.
>> 
>> Thoughts?
>> -John
>> 
>> On Jun 3, 2013, at 7:17 PM, Edison Su <Ed...@citrix.com> wrote:
>> 
>>> Let's start a new thread about NFS cache storage issues on object_store.
>>> First, I'll go through how NFS storage works on master branch, then how it
>> works on object_store branch, then let's talk about the "issues".
>>> 
>>> 0.       Why we need NFS secondary storage? Nfs secondary storage is used
>> as a place to store templates/snapshots etc, it's zone wide, and it's widely
>> supported by most of hypervisors(except HyperV). NFS storage exists in
>> CloudStack since 1.x. With the rising of object storage, like S3/Swift,
>> CloudStack adds the support of Swift in 3.x, and S3 in 4.0. You may wonder, if
>> S3/Swift is used as the place to store templates/snapshots, then why we still
>> need NFS secondary storage?
>>> 
>>> There are two reasons for that:
>>> 
>>> a.       CloudStack storage code is tightly coupled with NFS secondary storage,
>> so when adding Swift/S3 support, it's likely to take shortcut, leave NFS
>> secondary storage as it is.
>>> 
>>> b.      Certain hypervisors, and certain storage related operations, can not
>> directly operate on object storage.
>>> Examples:
>>> 
>>> b.1 When backing up snapshot(the snapshot taken from xenserver
>>> hypervisor) from primary storage to S3 in xenserver
>>> 
>>> If there are snapshot chains on the volume, and if we want to coalesce the
>> snapshot chains into a new disk, then copy it to S3, we either, coalesce the
>> snapshot chains on primary storage, or on an extra storage repository (SR)
>> that supported by Xenserver.
>>> 
>>> If we coalesce it on primary storage, then may blow up the primary storage,
>> as the coalesced new disk may need a lot of space(thinking about, the new
>> disk will contain all the content in from leaf snapshot, all the way up to base
>> template), but the primary storage is not planned to this
>> operation(cloudstack mgt server is unaware of this operation, the mgt server
>> may think the primary storage still has enough space to create volumes).
>>> 
>>> While xenserver doesn't have API to coalesce snapshots directly to S3, so
>> we have to use other storages that supported by Xenserver, that's why the
>> NFS storage is used during snapshot backup. So what we did is that first call
>> xenserver api to coalesce the snapshot to NFS storage, then copy the newly
>> created file into S3. This is what we did on both master branch and
>> object_store branch.
>>>                              b.2 When create volume from snapshot if the snapshot is
>> stored on S3.
>>>                                                If the snapshot is a delta snapshot, we need to
>> coalesce them into a new volume. We can't coalesce snapshots directly on S3,
>> AFAIK, so we have to download the snapshot and its parents into
>> somewhere, then coalesce them with xenserver's tools. Again, there are two
>> options, one is to download all the snapshots into primary storage, or
>> download them into NFS storage:
>>>                                               If we download all the snapshots into primary
>> storage directly from S3, then first we need find a way import snapshot from
>> S3 into Primary storage(if primary storage is a block device, then need extra
>> care) and then coalesce them. If we go this way, need to find a primary
>> storage with enough space, and even worse, if the primary storage is not
>> zone-wide, then later on, we may need to copy the volume from one
>> primary storage to another, which is time consuming.
>>>                                               If we download all the snapshots into NFS storage
>> from S3, then coalesce them, and then copy the volume to primary storage.
>> As the NFS storage is zone wide, so, you can copy the volume into whatever
>> primary storage, without extra copy. This is what we did in master branch and
>> object_store branch.
>>>                             b.3, some hypervisors, or some storages do not support
>> directly import template into primary storage from a URL. For example, if
>> Ceph is used as primary storage, when import a template into RBD, need
>> transform a Qcow2 image into RAW disk, then into RBD format 2. In order to
>> transform an image from Qcow2 image into RAW disk, you need extra file
>> system, either a local file system(this is what other stack does, which is not
>> scalable to me), or a NFS storage(this is what can be done on both master
>> and object_store). Or one can modify hypervisor or storage to support
>> directly import template from S3 into RBD. Here is the link(http://www.mail-
>> archive.com/ceph-devel@vger.kernel.org/msg14411.html), that Wido
>> posted.
>>>                Anyway, there are so many combination of hypervisors and
>> storages: for some hypervisors with zone wide file system based storage(e.g.
>> KVM + gluster/NFS as primary storage), you don't need extra nfs storage.
>> Also if you are using VMware or HyperV, which can import template from a
>> URL, regardless which storage your are using, then you don't need extra NFS
>> storage. While if you are using xenserver, in order to create volume from
>> delta snapshot, you will need a NFS storage, or if you are using KVM + Ceph,
>> you also may need a NFS storage.
>>>               Due to above reasons, NFS cache storage is need in certain cases if
>> S3 is used as secondary storage. The combination of hypervisors and storages
>> are quite complicated, to use cache storage or not, should be case by case.
>> But as long as cloudstack provides a framework, gives people the choice to
>> enable/disable cache storage on their own, then I think the framework is
>> good enough.
>>> 
>>> 
>>> 1.       Then let's talk about how NFS storage works on master branch, with
>> or without S3.
>>> If S3 is not used, here is the how NFS storage is used:
>>> 
>>> 1.1   Register a template/ISO: cloudstack downloads the template/ISO into
>> NFS storage.
>>> 
>>> 1.2   Backup snapshot: cloudstack sends a command to xenserver
>> hypervisor, issue vdi.copy command copy the snapshot to NFS, for kvm,
>> directly use "cp" or "qemu-img convert" to copy the snapshot into NFS
>> storage.
>>> 
>>> 1.3   Create volume from snapshot: If the snapshot is a delta snapshot,
>> coalesce them on NFS storage, then vdi.copy it from NFS to primary storage.
>> If it's KVM, use "cp" or "qemu-img convert" to copy the snapshot from NFS
>> storage to primary storage.
>>> 
>>> 
>>>              If S3 is used:
>>> 
>>> 1.4   Register a template/ISO: download the template/ISO into NFS storage
>> first, then there is background thread, which can upload the template/ISO
>> from NFS storage into S3 regularly. The template is in Ready state, only
>> means the template is stored on NFS storage, but admin doesn't know the
>> template is stored on the S3 or not. Even worse, if there are multiple zones,
>> cloudstack will copy the template from one zone wide NFS storage into
>> another NFS storage in another zone, while there is already has a region wide
>> S3 available. As the template is not directly uploaded to S3 when registering a
>> template, it will take several copy in order to spread the template into a
>> region wide.
>>> 
>>> 1.5   Backup snapshot: cloudstack sends a command to xenserver
>> hypervisor, copy the snapshot to NFS storage, then immediately, upload the
>> snapshot from NFS storage into S3. The snapshot is in Backedup state, not
>> only means the snapshot is in  NFS storage, but also means it's stored on S3.
>>> 
>>> 1.6   Create volume from snapshot: download the snapshot  and it's parent
>> snapshots from S3 into NFS storage, then coalesce and vdi.copy the volume
>> from NFS to primary storage.
>>> 
>>> 
>>> 
>>> 2.       Then let's talk about how it works on object_store:
>>> If S3 is not used, there is ZERO change from master branch. How the NFS
>> secondary storage works before, is the same on object_store.
>>> If S3 is used, and NFS cache storage used also(which is by default):
>>>  2.1 Register a template/ISO: the template/ISO are directly uploaded to S3,
>> there is no extra copy to NFS storage. When the template is in "Ready" state,
>> means the template is stored on S3.                  It implies that: the template is
>> immediately available in the region as soon as it's in Ready State. And admin
>> can clearly knows the status of template on S3, what's percentage of the
>> uploading, is it failed or succeed? Also if register template failed for some
>> reason, admin can issue the register template command again. I would say
>> the change of how to register template into S3 is far better than what we did
>> on master branch.
>>>  2.2 Backup snapshot: it's same as master branch, sends a command to
>> xenserver host, copy the snapshot into NFS, then upload to S3.
>>>  2.3 Create volume from snapshot: it's the same as master branch,
>> download snapshot and it's parent snaphots from S3 into NFS, then copy it
>> from NFS to primary storage.
>>> From above few typical usage cases, you may understand how S3 and NFS
>> cache storage is used, and what's difference between object_store branch
>> and master branch: basically, we only change the way how to register a
>> template, nothing else.
>>> If S3 is used, and no NFS cache storage is used(it's possible, depends on
>> which datamotion strategy is used):
>>>   2.4 Register a template/ISO: it's the same as 2.1
>>>   2.5 Backup snapshot: export the snapshot from primary storage into S3
>> directly
>>>   2.6 Create volume from snapshot: download snapshots from S3 into
>> primary storage directly, then coalesce and create volume from it.
>>> 
>>>         Hope above explanation will tell the truth how the system works on
>> object_store, and clarify the misconception/misunderstanding  about
>> object_store branch. Even the change is huge, we still maintain the back
>> compatibility. If you don't want to use S3, only want to existing NFS storage,
>> it's definitely OK, it works the same as before. If you want to use S3, we
>> provide a better S3 implementation when registering template/ISO. If you
>> want to use S3 without NFS storage, that's also definitely OK,  the framework
>> is quite flexible to accommodate different solutions.
>>> 
>>> Ok, let's talk  about the NFS storage cache issues.
>>> The issue about NFS cache storage is discussed in several threads, back and
>> forth. All in all, the NFs cache storage is only one usage case out of three
>> usage cases supported by object_store branch. It's not something that if it
>> has issue, then everything doesn't work.
>>> In above 2.2 and 2.3, it shows how the NFS cache storage is involved during
>> snapshot related operations. The complains about there is no aging policy, no
>> capacity planner for NFS cache storage, is happened when download a
>> snapshot from S3 into NFS, or copy a snapshot from primary storage into NFS,
>> or download template from S3 into NFS. Yes, it's an issue, the NFS cache
>> storage can be used out, if there is no capacity planner, and no aging out
>> policy. But can it be fixed? Is it a design issue?
>>> Let's talk the code: Here is the code related to NFS cache storage,
>>> not much, only one class depends on NFS cache storage:
>>> https://git-wip-us.apache.org/repos/asf?p=cloudstack.git;a=blob;f=engi
>>> 
>> ne/storage/datamotion/src/org/apache/cloudstack/storage/motion/Ancient
>>> 
>> DataMotionStrategy.java;h=a01d2d30139f70ad8c907b6d6bc9759d47dcc2d6;h
>> b=
>>> refs/heads/object_store Take copyVolumeFromSnapshot as example,
>> which
>>> will be called when create Volume from snapshot, if first calls
>>> cacheSnapshotChain, which will call cacheMgr.createCacheObject to
>>> download the snapshot into NFs cache storage.
>>> StorageCacheManagerImpl-> createCacheObject is the only place to
>>> create objects on NFs cache storage, the code is at
>>> https://git-wip-us.apache.org/repos/asf?p=cloudstack.git;a=blob;f=engi
>>> 
>> ne/storage/cache/src/org/apache/cloudstack/storage/cache/manager/Stora
>>> 
>> geCacheManagerImpl.java;h=cb5ea106fed3e5d2135dca7d98aede13effcf7d9;
>> hb=
>>> refs/heads/object_store In createCacheObject, it will first find out a
>>> cache storage, in case there are multiple cache storages available in a scope:
>>> DataStore cacheStore = this.getCacheStorage(scope); getCacheStorage
>>> will call StorageCacheAllocator to find out a proper NFS cache storage. So
>> StorageCacheAllocator is the place to choose NFS cache storage based on
>> certain criteria, the current implementation only randomly choose one of
>> them, we can add a new allocator algorithm, based on capacity etc, etc.
>>> Regarding capacity reservation, there is already a table, called
>> op_host_capacity which has entry for NFS secondary storage, we can reuse
>> this entry to store capacity information about NFS cache storages(such as,
>> total size, available/used capacity etc). So when every call createCacheObject,
>> we can call StorageCacheAllocator to find out a proper NFS storage based on
>> first fit criteria, then increase used capacity in op_host_capacity table. If the
>> create cache object failed, return the capacity to op_host_capacity.
>>> 
>>> Regarding the aging out policy, we can start a background thread on mgt
>> server, which will scan all the objects created on NFS cache storage(the
>> tables called: snapshot_store_ref, template_store_ref, volume_store_ref),
>> each entry of these tables has a column called: updated, every time, when
>> the object's state is changed, the "updated" column will be got updated also.
>> When the object's state is changed? Every time, when the object is used in
>> some contexts(such as copy the snapshot on NFS cache storage into
>> somewhere), the object's state will be changed  accordingly, such as
>> "Copying", means the object is being copied to some place, which is exactly
>> the information we need to implement LRU algorithm.
>>> 
>>> How do you guys think about the fix? If you have better solution, please let
>> me know.
>>> 
>>> 
> 


RE: [DISCUSS] NFS cache storage issue on object_store

Posted by Edison Su <Ed...@citrix.com>.

> -----Original Message-----
> From: John Burwell [mailto:jburwell@basho.com]
> Sent: Wednesday, June 05, 2013 1:04 PM
> To: dev@cloudstack.apache.org
> Subject: Re: [DISCUSS] NFS cache storage issue on object_store
> 
> Edison,
> 
> You have provided some great information below which helps greatly to
> understand the role of the "NFS cache" mechanism.  To summarize, this
> mechanism is only currently required for Xen snapshot operations driven by
> Xen's coalescing operations.  Is my understanding correct?  Just out of

I think Ceph may still need "NFS cache", for example, during delta snapshot backup:
http://ceph.com/dev-notes/incremental-snapshots-with-rbd/
You need to create a delta snapshot into a file, then upload the file into S3.

For KVM, if the snapshot is taken on qcow2, then need to copy the snapshot into a file system, then backup it to S3.

Another usage case for "NFS cache " is to cache template stored on S3, if there is no zone-wide primary storage. We need to download template from S3 into every primary storage, if there is no cache, each download will take a while: comparing download template directly from S3(if the S3 is region wide) with download from a zone wide "cache" storage, I would say, the download from zone wide cache storage should be faster than from region wide S3. If there is no zone wide primary storage, then we will download the template from S3 several times, which is quite time consuming.


There may have other places to use "NFS cache", but the point is as long as mgt server can be decoupled from this "cache" storage, then we can 
decide when/how to use cache storage based on different kind of hypervisor/storage combinations in the future.

> curiosity, is their a Xen expert on the list who can provide a high-level
> description of the coalescing operation -- in particular, the way it interacts
> with storage?  I have Googled a bit, and found very little information about it.
> Has the object_store branch been tested with VMWare and KVM?  If so,
> what operations on these hypervisors have been tested?

Both vmware and KVM is tested, but without S3 support. Haven't have time to take a look at how to use S3 in both hypervisors yet. 
For example, we should take a look at how to import a template from url into vmware data store, thus, we can eliminate "NFS cache" during template import.

> 
> In reading through the description below, my operation concerns remain
> regarding potential race conditions and resource exhaustion.  Also, in reading
> through the description, I think we should find a new name for this
> mechanism.  As Chip has previous mentioned, a cache implies the following
> characteristics:
> 
>     1. Optional: Systems can operate without caches just more slowly.
> However, with this mechanism, snapshots on Xen will not function.


I agree on this one.

>     2. Volatility: Caches are backed by durable, non-volitale storage.  Therefore,
> if the cache's data is lost, it can be rebuilt from the backing store and no data
> will be permanently lost from the system.  However, this mechanism
> contains snapshots in-transit to an object store.  If the data contained in this
> "cache" were lost before its transfer to the object store completed, the
> snapshot data would be lost.

It's the same thing for file cache on Linux file system. If the file cache is not flushed into disk, while the machine lost power, then the data on the file cache is lost.
When we backup the snapshot from primary storage to S3, the snapshot is copied to "Nfs cache", then immediately, copied from "Nfs cache" into S3. If the snapshot on "Nfs cache" is lost, then the snapshot backup is failed. User can issue another backup snapshot command in this case. 
So I don't think it's an issue.

> 
> In order to set expectations with users and better frame our design
> conversation, I think it would be appropriate this mechanism as a staging,

Ok, seems cache is confusing people, we can use other term, or document it clearly, what's the role of the storage.
Yes, it's just a temporary file system, which can be used to store some temporary files.

> scratch, or temporary area.  I also recommend removing the notion of NFS its
> name as NFS is initial implementation of this mechanism.  In the future, I can
> see a desire for local filesystem, RBD, and iSCSI implementations of it.

Agree, any storage can be used as "Cache" storage. If you take a look at storagemanagerImpl->createCacheStore, it's nothing related to NFS.

> 
> In terms of solving the potential race conditions and resource exhaustion
> issues, I don't think an LRU approach will be sufficient because the least
> recently used resource may be still be in use by the system.  I think we
> should look to a reservation model with reference counting where files are
> deleted when once no processes are accessing them.  The following is a
> (handwave-handwave) overview of the process I think would meet these
> requirements:
> 
> 	1. Request a reservation for the maximum size of the file(s) that will
> be processed in the staging area.
> 		- If the file is already in the staging area, increase its
> reference count
> 		- If the reservation can not be fulfilled, we can either drop
> the process in a retry queue or reject it.
> 	2. Perform work and transfer file(s) to/from the object store
> 	3. Release the file(s) -- decrementing the reference count.  When
> the reference count is <= 0, delete the file(s) from the staging area

I assume the reference count is stored in memory and inside SSVM?
The reference count may not work properly, in case of multiple secondary storage VMs and multiple mgt servers. And there may have a lot of places other than SSVM can directly use the cached object.
If we store the reference count on file system, then need to take a lock(such as nfs lock, or lock file)to update, while the lock can be failed to release, due to all kind of reasons(such as network).

I thought about it yesterday, about how to implement LRU. Originally, I though, we could eliminate race condition and track who is using objects stored on cache storage by using state machine
For example, whenever mgt server wants to use the cached object, mgt server can change the state for the cached object to "Copying"(there is a DB entry for each cached object), after the copy is finished, then change the state into "Ready", and also update "updated" column. It will eliminate the race condition, as only one thread can access the cached object, and change its state. But the problem of this way, is that, there are cases that multiple reader threads may want to read the cached object at the same time: e.g. copy the same cached template to multiple primary storages at the same time.

In order to accommodate multiple readers, I am trying to add a new db table to track the users of  the cached object.
The follow will be like the following:
1. mgt server wants to use the cached object, first, need to check the state of the cached object, the state must be in ready state.
2. mgt server writes a db entry into DB, the entry will contain, the id of cached object, the id of cached storage, the issued time. The db entry will also contain a state: the state can be initial/processing/finished/failed. Mgt server needs to set the state as "processing".
3. mgt server finishes the operation related the cached object, then mark state of above db entry as "finished",  also update the time column of above entry.
4. the above db entries will be removed if the state is not in "processing" for a while(let's say one week?), or if the entry is in the "processing" state for a while(let's say one day). In this way, mgt server can easily know which cached object is used or not used recently, by take a look this db table.
5. If mgt server find a cached object is not used(there is no db entry in the above table) for a while(let's say one week), then change the state of the cached object into "destroying", then send command to ssvm to destroy the object.
6. There is small window, that mgt server is changing the state of cached object into "destroying"(there is no db entry is in "processing" state in the above table,), while another thread is trying to copying(as the cached object state is still in ready state), both DB operations will success, we can hold a DB lock on the cached object entry, before both DB opeations.

How do you think?
 
> 
> We would also likely want to consider a TTL to purge files after a configurable
> period of inactivity as a backstop against crashed processes failing to properly
> decrementing the reference count.  In this model, we will either defer or
> reject work if resources are not available, and we properly bound resources.

Yes, it should be taken into consideration for all the time consuming operations.

> 
> Finally, in terms of decoupling the decision to use of this mechanism by
> hypervisor plugins from the storage subsystem, I think we should expose
> methods on the secondary storage services that allow clients to explicitly
> request or create resources using files (i.e. java.io.File) instead of streams
> (e.g. createXXX(File) or readXXXAsFile).  These interfaces would provide the
> storage subsystem with the hint that the client requires file access to the
> request resource.   For object store plugins, this hint would be used to wrap
> the resource in an object that would transfer in and/out of the staging area.
> 
> Thoughts?
> -John
> 
> On Jun 3, 2013, at 7:17 PM, Edison Su <Ed...@citrix.com> wrote:
> 
> > Let's start a new thread about NFS cache storage issues on object_store.
> > First, I'll go through how NFS storage works on master branch, then how it
> works on object_store branch, then let's talk about the "issues".
> >
> > 0.       Why we need NFS secondary storage? Nfs secondary storage is used
> as a place to store templates/snapshots etc, it's zone wide, and it's widely
> supported by most of hypervisors(except HyperV). NFS storage exists in
> CloudStack since 1.x. With the rising of object storage, like S3/Swift,
> CloudStack adds the support of Swift in 3.x, and S3 in 4.0. You may wonder, if
> S3/Swift is used as the place to store templates/snapshots, then why we still
> need NFS secondary storage?
> >
> > There are two reasons for that:
> >
> > a.       CloudStack storage code is tightly coupled with NFS secondary storage,
> so when adding Swift/S3 support, it's likely to take shortcut, leave NFS
> secondary storage as it is.
> >
> > b.      Certain hypervisors, and certain storage related operations, can not
> directly operate on object storage.
> > Examples:
> >
> > b.1 When backing up snapshot(the snapshot taken from xenserver
> > hypervisor) from primary storage to S3 in xenserver
> >
> > If there are snapshot chains on the volume, and if we want to coalesce the
> snapshot chains into a new disk, then copy it to S3, we either, coalesce the
> snapshot chains on primary storage, or on an extra storage repository (SR)
> that supported by Xenserver.
> >
> > If we coalesce it on primary storage, then may blow up the primary storage,
> as the coalesced new disk may need a lot of space(thinking about, the new
> disk will contain all the content in from leaf snapshot, all the way up to base
> template), but the primary storage is not planned to this
> operation(cloudstack mgt server is unaware of this operation, the mgt server
> may think the primary storage still has enough space to create volumes).
> >
> > While xenserver doesn't have API to coalesce snapshots directly to S3, so
> we have to use other storages that supported by Xenserver, that's why the
> NFS storage is used during snapshot backup. So what we did is that first call
> xenserver api to coalesce the snapshot to NFS storage, then copy the newly
> created file into S3. This is what we did on both master branch and
> object_store branch.
> >                               b.2 When create volume from snapshot if the snapshot is
> stored on S3.
> >                                                 If the snapshot is a delta snapshot, we need to
> coalesce them into a new volume. We can't coalesce snapshots directly on S3,
> AFAIK, so we have to download the snapshot and its parents into
> somewhere, then coalesce them with xenserver's tools. Again, there are two
> options, one is to download all the snapshots into primary storage, or
> download them into NFS storage:
> >                                                If we download all the snapshots into primary
> storage directly from S3, then first we need find a way import snapshot from
> S3 into Primary storage(if primary storage is a block device, then need extra
> care) and then coalesce them. If we go this way, need to find a primary
> storage with enough space, and even worse, if the primary storage is not
> zone-wide, then later on, we may need to copy the volume from one
> primary storage to another, which is time consuming.
> >                                                If we download all the snapshots into NFS storage
> from S3, then coalesce them, and then copy the volume to primary storage.
> As the NFS storage is zone wide, so, you can copy the volume into whatever
> primary storage, without extra copy. This is what we did in master branch and
> object_store branch.
> >                              b.3, some hypervisors, or some storages do not support
> directly import template into primary storage from a URL. For example, if
> Ceph is used as primary storage, when import a template into RBD, need
> transform a Qcow2 image into RAW disk, then into RBD format 2. In order to
> transform an image from Qcow2 image into RAW disk, you need extra file
> system, either a local file system(this is what other stack does, which is not
> scalable to me), or a NFS storage(this is what can be done on both master
> and object_store). Or one can modify hypervisor or storage to support
> directly import template from S3 into RBD. Here is the link(http://www.mail-
> archive.com/ceph-devel@vger.kernel.org/msg14411.html), that Wido
> posted.
> >                 Anyway, there are so many combination of hypervisors and
> storages: for some hypervisors with zone wide file system based storage(e.g.
> KVM + gluster/NFS as primary storage), you don't need extra nfs storage.
> Also if you are using VMware or HyperV, which can import template from a
> URL, regardless which storage your are using, then you don't need extra NFS
> storage. While if you are using xenserver, in order to create volume from
> delta snapshot, you will need a NFS storage, or if you are using KVM + Ceph,
> you also may need a NFS storage.
> >                Due to above reasons, NFS cache storage is need in certain cases if
> S3 is used as secondary storage. The combination of hypervisors and storages
> are quite complicated, to use cache storage or not, should be case by case.
> But as long as cloudstack provides a framework, gives people the choice to
> enable/disable cache storage on their own, then I think the framework is
> good enough.
> >
> >
> > 1.       Then let's talk about how NFS storage works on master branch, with
> or without S3.
> > If S3 is not used, here is the how NFS storage is used:
> >
> > 1.1   Register a template/ISO: cloudstack downloads the template/ISO into
> NFS storage.
> >
> > 1.2   Backup snapshot: cloudstack sends a command to xenserver
> hypervisor, issue vdi.copy command copy the snapshot to NFS, for kvm,
> directly use "cp" or "qemu-img convert" to copy the snapshot into NFS
> storage.
> >
> > 1.3   Create volume from snapshot: If the snapshot is a delta snapshot,
> coalesce them on NFS storage, then vdi.copy it from NFS to primary storage.
> If it's KVM, use "cp" or "qemu-img convert" to copy the snapshot from NFS
> storage to primary storage.
> >
> >
> >               If S3 is used:
> >
> > 1.4   Register a template/ISO: download the template/ISO into NFS storage
> first, then there is background thread, which can upload the template/ISO
> from NFS storage into S3 regularly. The template is in Ready state, only
> means the template is stored on NFS storage, but admin doesn't know the
> template is stored on the S3 or not. Even worse, if there are multiple zones,
> cloudstack will copy the template from one zone wide NFS storage into
> another NFS storage in another zone, while there is already has a region wide
> S3 available. As the template is not directly uploaded to S3 when registering a
> template, it will take several copy in order to spread the template into a
> region wide.
> >
> > 1.5   Backup snapshot: cloudstack sends a command to xenserver
> hypervisor, copy the snapshot to NFS storage, then immediately, upload the
> snapshot from NFS storage into S3. The snapshot is in Backedup state, not
> only means the snapshot is in  NFS storage, but also means it's stored on S3.
> >
> > 1.6   Create volume from snapshot: download the snapshot  and it's parent
> snapshots from S3 into NFS storage, then coalesce and vdi.copy the volume
> from NFS to primary storage.
> >
> >
> >
> > 2.       Then let's talk about how it works on object_store:
> > If S3 is not used, there is ZERO change from master branch. How the NFS
> secondary storage works before, is the same on object_store.
> > If S3 is used, and NFS cache storage used also(which is by default):
> >   2.1 Register a template/ISO: the template/ISO are directly uploaded to S3,
> there is no extra copy to NFS storage. When the template is in "Ready" state,
> means the template is stored on S3.                  It implies that: the template is
> immediately available in the region as soon as it's in Ready State. And admin
> can clearly knows the status of template on S3, what's percentage of the
> uploading, is it failed or succeed? Also if register template failed for some
> reason, admin can issue the register template command again. I would say
> the change of how to register template into S3 is far better than what we did
> on master branch.
> >   2.2 Backup snapshot: it's same as master branch, sends a command to
> xenserver host, copy the snapshot into NFS, then upload to S3.
> >   2.3 Create volume from snapshot: it's the same as master branch,
> download snapshot and it's parent snaphots from S3 into NFS, then copy it
> from NFS to primary storage.
> > From above few typical usage cases, you may understand how S3 and NFS
> cache storage is used, and what's difference between object_store branch
> and master branch: basically, we only change the way how to register a
> template, nothing else.
> > If S3 is used, and no NFS cache storage is used(it's possible, depends on
> which datamotion strategy is used):
> >    2.4 Register a template/ISO: it's the same as 2.1
> >    2.5 Backup snapshot: export the snapshot from primary storage into S3
> directly
> >    2.6 Create volume from snapshot: download snapshots from S3 into
> primary storage directly, then coalesce and create volume from it.
> >
> >          Hope above explanation will tell the truth how the system works on
> object_store, and clarify the misconception/misunderstanding  about
> object_store branch. Even the change is huge, we still maintain the back
> compatibility. If you don't want to use S3, only want to existing NFS storage,
> it's definitely OK, it works the same as before. If you want to use S3, we
> provide a better S3 implementation when registering template/ISO. If you
> want to use S3 without NFS storage, that's also definitely OK,  the framework
> is quite flexible to accommodate different solutions.
> >
> > Ok, let's talk  about the NFS storage cache issues.
> > The issue about NFS cache storage is discussed in several threads, back and
> forth. All in all, the NFs cache storage is only one usage case out of three
> usage cases supported by object_store branch. It's not something that if it
> has issue, then everything doesn't work.
> > In above 2.2 and 2.3, it shows how the NFS cache storage is involved during
> snapshot related operations. The complains about there is no aging policy, no
> capacity planner for NFS cache storage, is happened when download a
> snapshot from S3 into NFS, or copy a snapshot from primary storage into NFS,
> or download template from S3 into NFS. Yes, it's an issue, the NFS cache
> storage can be used out, if there is no capacity planner, and no aging out
> policy. But can it be fixed? Is it a design issue?
> > Let's talk the code: Here is the code related to NFS cache storage,
> > not much, only one class depends on NFS cache storage:
> > https://git-wip-us.apache.org/repos/asf?p=cloudstack.git;a=blob;f=engi
> >
> ne/storage/datamotion/src/org/apache/cloudstack/storage/motion/Ancient
> >
> DataMotionStrategy.java;h=a01d2d30139f70ad8c907b6d6bc9759d47dcc2d6;h
> b=
> > refs/heads/object_store Take copyVolumeFromSnapshot as example,
> which
> > will be called when create Volume from snapshot, if first calls
> > cacheSnapshotChain, which will call cacheMgr.createCacheObject to
> > download the snapshot into NFs cache storage.
> > StorageCacheManagerImpl-> createCacheObject is the only place to
> > create objects on NFs cache storage, the code is at
> > https://git-wip-us.apache.org/repos/asf?p=cloudstack.git;a=blob;f=engi
> >
> ne/storage/cache/src/org/apache/cloudstack/storage/cache/manager/Stora
> >
> geCacheManagerImpl.java;h=cb5ea106fed3e5d2135dca7d98aede13effcf7d9;
> hb=
> > refs/heads/object_store In createCacheObject, it will first find out a
> > cache storage, in case there are multiple cache storages available in a scope:
> > DataStore cacheStore = this.getCacheStorage(scope); getCacheStorage
> > will call StorageCacheAllocator to find out a proper NFS cache storage. So
> StorageCacheAllocator is the place to choose NFS cache storage based on
> certain criteria, the current implementation only randomly choose one of
> them, we can add a new allocator algorithm, based on capacity etc, etc.
> > Regarding capacity reservation, there is already a table, called
> op_host_capacity which has entry for NFS secondary storage, we can reuse
> this entry to store capacity information about NFS cache storages(such as,
> total size, available/used capacity etc). So when every call createCacheObject,
> we can call StorageCacheAllocator to find out a proper NFS storage based on
> first fit criteria, then increase used capacity in op_host_capacity table. If the
> create cache object failed, return the capacity to op_host_capacity.
> >
> > Regarding the aging out policy, we can start a background thread on mgt
> server, which will scan all the objects created on NFS cache storage(the
> tables called: snapshot_store_ref, template_store_ref, volume_store_ref),
> each entry of these tables has a column called: updated, every time, when
> the object's state is changed, the "updated" column will be got updated also.
> When the object's state is changed? Every time, when the object is used in
> some contexts(such as copy the snapshot on NFS cache storage into
> somewhere), the object's state will be changed  accordingly, such as
> "Copying", means the object is being copied to some place, which is exactly
> the information we need to implement LRU algorithm.
> >
> > How do you guys think about the fix? If you have better solution, please let
> me know.
> >
> >


RE: [DISCUSS] NFS cache storage issue on object_store

Posted by Edison Su <Ed...@citrix.com>.

> -----Original Message-----
> From: John Burwell [mailto:jburwell@basho.com]
> Sent: Wednesday, June 05, 2013 1:06 PM
> To: dev@cloudstack.apache.org
> Subject: Re: [DISCUSS] NFS cache storage issue on object_store
> 
> Edison,
> 
> One thing I forgot to say is that reference counting be an unnecessary
> complexity in the event that sharing of the same resource by multiple
> process concurrently is rare.

I assuming you talking about reference counting inside SSVM, right?
If there are multiple SSVMs, then how do you coordinate the reference counting?

Better and safer to do it on mgt server, as it can stored this information in the DB.

And another issue about reference counting, is we should not immediately delete a cached object, if the reference counting is zero.
The reason is the cached object may be used later on. For example, we may cache a template on the cache storage, when we create a VM from a template on one primary storage for the first time. Later on, mgt server decides to create another VM on another primary storage, then we need to download the template from cache storage into the second primary storage. If the cached template is been deleted after its been copied into first primary storage, then we need to download the template from S3 again.

As creating VM from a template is a frequent operation, thus the downloading cached template into primary storage is also a frequent operation, if there is no zone-wide primary storage, so I think we'd better use LRU, instead of reference counting.


> 
> Thanks,
> -John
> 
> On Jun 5, 2013, at 4:04 PM, John Burwell <jb...@basho.com> wrote:
> 
> > Edison,
> >
> > You have provided some great information below which helps greatly to
> understand the role of the "NFS cache" mechanism.  To summarize, this
> mechanism is only currently required for Xen snapshot operations driven by
> Xen's coalescing operations.  Is my understanding correct?  Just out of
> curiosity, is their a Xen expert on the list who can provide a high-level
> description of the coalescing operation -- in particular, the way it interacts
> with storage?  I have Googled a bit, and found very little information about it.
> Has the object_store branch been tested with VMWare and KVM?  If so,
> what operations on these hypervisors have been tested?
> >
> > In reading through the description below, my operation concerns remain
> regarding potential race conditions and resource exhaustion.  Also, in reading
> through the description, I think we should find a new name for this
> mechanism.  As Chip has previous mentioned, a cache implies the following
> characteristics:
> >
> >    1. Optional: Systems can operate without caches just more slowly.
> However, with this mechanism, snapshots on Xen will not function.
> >    2. Volatility: Caches are backed by durable, non-volitale storage.
> Therefore, if the cache's data is lost, it can be rebuilt from the backing store
> and no data will be permanently lost from the system.  However, this
> mechanism contains snapshots in-transit to an object store.  If the data
> contained in this "cache" were lost before its transfer to the object store
> completed, the snapshot data would be lost.
> >
> > In order to set expectations with users and better frame our design
> conversation, I think it would be appropriate this mechanism as a staging,
> scratch, or temporary area.  I also recommend removing the notion of NFS its
> name as NFS is initial implementation of this mechanism.  In the future, I can
> see a desire for local filesystem, RBD, and iSCSI implementations of it.
> >
> > In terms of solving the potential race conditions and resource exhaustion
> issues, I don't think an LRU approach will be sufficient because the least
> recently used resource may be still be in use by the system.  I think we
> should look to a reservation model with reference counting where files are
> deleted when once no processes are accessing them.  The following is a
> (handwave-handwave) overview of the process I think would meet these
> requirements:
> >
> > 	1. Request a reservation for the maximum size of the file(s) that will
> be processed in the staging area.
> > 		- If the file is already in the staging area, increase its
> reference count
> > 		- If the reservation can not be fulfilled, we can either drop
> the process in a retry queue or reject it.
> > 	2. Perform work and transfer file(s) to/from the object store
> > 	3. Release the file(s) -- decrementing the reference count.  When
> the
> > reference count is <= 0, delete the file(s) from the staging area
> >
> > We would also likely want to consider a TTL to purge files after a
> configurable period of inactivity as a backstop against crashed processes
> failing to properly decrementing the reference count.  In this model, we will
> either defer or reject work if resources are not available, and we properly
> bound resources.
> >
> > Finally, in terms of decoupling the decision to use of this mechanism by
> hypervisor plugins from the storage subsystem, I think we should expose
> methods on the secondary storage services that allow clients to explicitly
> request or create resources using files (i.e. java.io.File) instead of streams
> (e.g. createXXX(File) or readXXXAsFile).  These interfaces would provide the
> storage subsystem with the hint that the client requires file access to the
> request resource.   For object store plugins, this hint would be used to wrap
> the resource in an object that would transfer in and/out of the staging area.
> >
> > Thoughts?
> > -John
> >
> > On Jun 3, 2013, at 7:17 PM, Edison Su <Ed...@citrix.com> wrote:
> >
> >> Let's start a new thread about NFS cache storage issues on object_store.
> >> First, I'll go through how NFS storage works on master branch, then how it
> works on object_store branch, then let's talk about the "issues".
> >>
> >> 0.       Why we need NFS secondary storage? Nfs secondary storage is used
> as a place to store templates/snapshots etc, it's zone wide, and it's widely
> supported by most of hypervisors(except HyperV). NFS storage exists in
> CloudStack since 1.x. With the rising of object storage, like S3/Swift,
> CloudStack adds the support of Swift in 3.x, and S3 in 4.0. You may wonder, if
> S3/Swift is used as the place to store templates/snapshots, then why we still
> need NFS secondary storage?
> >>
> >> There are two reasons for that:
> >>
> >> a.       CloudStack storage code is tightly coupled with NFS secondary
> storage, so when adding Swift/S3 support, it's likely to take shortcut, leave
> NFS secondary storage as it is.
> >>
> >> b.      Certain hypervisors, and certain storage related operations, can not
> directly operate on object storage.
> >> Examples:
> >>
> >> b.1 When backing up snapshot(the snapshot taken from xenserver
> >> hypervisor) from primary storage to S3 in xenserver
> >>
> >> If there are snapshot chains on the volume, and if we want to coalesce
> the snapshot chains into a new disk, then copy it to S3, we either, coalesce
> the snapshot chains on primary storage, or on an extra storage repository (SR)
> that supported by Xenserver.
> >>
> >> If we coalesce it on primary storage, then may blow up the primary
> storage, as the coalesced new disk may need a lot of space(thinking about,
> the new disk will contain all the content in from leaf snapshot, all the way up
> to base template), but the primary storage is not planned to this
> operation(cloudstack mgt server is unaware of this operation, the mgt server
> may think the primary storage still has enough space to create volumes).
> >>
> >> While xenserver doesn't have API to coalesce snapshots directly to S3, so
> we have to use other storages that supported by Xenserver, that's why the
> NFS storage is used during snapshot backup. So what we did is that first call
> xenserver api to coalesce the snapshot to NFS storage, then copy the newly
> created file into S3. This is what we did on both master branch and
> object_store branch.
> >>                              b.2 When create volume from snapshot if the snapshot is
> stored on S3.
> >>                                                If the snapshot is a delta snapshot, we need to
> coalesce them into a new volume. We can't coalesce snapshots directly on S3,
> AFAIK, so we have to download the snapshot and its parents into
> somewhere, then coalesce them with xenserver's tools. Again, there are two
> options, one is to download all the snapshots into primary storage, or
> download them into NFS storage:
> >>                                               If we download all the snapshots into primary
> storage directly from S3, then first we need find a way import snapshot from
> S3 into Primary storage(if primary storage is a block device, then need extra
> care) and then coalesce them. If we go this way, need to find a primary
> storage with enough space, and even worse, if the primary storage is not
> zone-wide, then later on, we may need to copy the volume from one
> primary storage to another, which is time consuming.
> >>                                               If we download all the snapshots into NFS storage
> from S3, then coalesce them, and then copy the volume to primary storage.
> As the NFS storage is zone wide, so, you can copy the volume into whatever
> primary storage, without extra copy. This is what we did in master branch and
> object_store branch.
> >>                             b.3, some hypervisors, or some storages do not support
> directly import template into primary storage from a URL. For example, if
> Ceph is used as primary storage, when import a template into RBD, need
> transform a Qcow2 image into RAW disk, then into RBD format 2. In order to
> transform an image from Qcow2 image into RAW disk, you need extra file
> system, either a local file system(this is what other stack does, which is not
> scalable to me), or a NFS storage(this is what can be done on both master
> and object_store). Or one can modify hypervisor or storage to support
> directly import template from S3 into RBD. Here is the link(http://www.mail-
> archive.com/ceph-devel@vger.kernel.org/msg14411.html), that Wido
> posted.
> >>                Anyway, there are so many combination of hypervisors and
> storages: for some hypervisors with zone wide file system based storage(e.g.
> KVM + gluster/NFS as primary storage), you don't need extra nfs storage.
> Also if you are using VMware or HyperV, which can import template from a
> URL, regardless which storage your are using, then you don't need extra NFS
> storage. While if you are using xenserver, in order to create volume from
> delta snapshot, you will need a NFS storage, or if you are using KVM + Ceph,
> you also may need a NFS storage.
> >>               Due to above reasons, NFS cache storage is need in certain cases if
> S3 is used as secondary storage. The combination of hypervisors and storages
> are quite complicated, to use cache storage or not, should be case by case.
> But as long as cloudstack provides a framework, gives people the choice to
> enable/disable cache storage on their own, then I think the framework is
> good enough.
> >>
> >>
> >> 1.       Then let's talk about how NFS storage works on master branch, with
> or without S3.
> >> If S3 is not used, here is the how NFS storage is used:
> >>
> >> 1.1   Register a template/ISO: cloudstack downloads the template/ISO into
> NFS storage.
> >>
> >> 1.2   Backup snapshot: cloudstack sends a command to xenserver
> hypervisor, issue vdi.copy command copy the snapshot to NFS, for kvm,
> directly use "cp" or "qemu-img convert" to copy the snapshot into NFS
> storage.
> >>
> >> 1.3   Create volume from snapshot: If the snapshot is a delta snapshot,
> coalesce them on NFS storage, then vdi.copy it from NFS to primary storage.
> If it's KVM, use "cp" or "qemu-img convert" to copy the snapshot from NFS
> storage to primary storage.
> >>
> >>
> >>              If S3 is used:
> >>
> >> 1.4   Register a template/ISO: download the template/ISO into NFS
> storage first, then there is background thread, which can upload the
> template/ISO from NFS storage into S3 regularly. The template is in Ready
> state, only means the template is stored on NFS storage, but admin doesn't
> know the template is stored on the S3 or not. Even worse, if there are
> multiple zones, cloudstack will copy the template from one zone wide NFS
> storage into another NFS storage in another zone, while there is already has
> a region wide S3 available. As the template is not directly uploaded to S3
> when registering a template, it will take several copy in order to spread the
> template into a region wide.
> >>
> >> 1.5   Backup snapshot: cloudstack sends a command to xenserver
> hypervisor, copy the snapshot to NFS storage, then immediately, upload the
> snapshot from NFS storage into S3. The snapshot is in Backedup state, not
> only means the snapshot is in  NFS storage, but also means it's stored on S3.
> >>
> >> 1.6   Create volume from snapshot: download the snapshot  and it's
> parent snapshots from S3 into NFS storage, then coalesce and vdi.copy the
> volume from NFS to primary storage.
> >>
> >>
> >>
> >> 2.       Then let's talk about how it works on object_store:
> >> If S3 is not used, there is ZERO change from master branch. How the NFS
> secondary storage works before, is the same on object_store.
> >> If S3 is used, and NFS cache storage used also(which is by default):
> >>  2.1 Register a template/ISO: the template/ISO are directly uploaded to S3,
> there is no extra copy to NFS storage. When the template is in "Ready" state,
> means the template is stored on S3.                  It implies that: the template is
> immediately available in the region as soon as it's in Ready State. And admin
> can clearly knows the status of template on S3, what's percentage of the
> uploading, is it failed or succeed? Also if register template failed for some
> reason, admin can issue the register template command again. I would say
> the change of how to register template into S3 is far better than what we did
> on master branch.
> >>  2.2 Backup snapshot: it's same as master branch, sends a command to
> xenserver host, copy the snapshot into NFS, then upload to S3.
> >>  2.3 Create volume from snapshot: it's the same as master branch,
> download snapshot and it's parent snaphots from S3 into NFS, then copy it
> from NFS to primary storage.
> >> From above few typical usage cases, you may understand how S3 and NFS
> cache storage is used, and what's difference between object_store branch
> and master branch: basically, we only change the way how to register a
> template, nothing else.
> >> If S3 is used, and no NFS cache storage is used(it's possible, depends on
> which datamotion strategy is used):
> >>   2.4 Register a template/ISO: it's the same as 2.1
> >>   2.5 Backup snapshot: export the snapshot from primary storage into S3
> directly
> >>   2.6 Create volume from snapshot: download snapshots from S3 into
> primary storage directly, then coalesce and create volume from it.
> >>
> >>         Hope above explanation will tell the truth how the system works on
> object_store, and clarify the misconception/misunderstanding  about
> object_store branch. Even the change is huge, we still maintain the back
> compatibility. If you don't want to use S3, only want to existing NFS storage,
> it's definitely OK, it works the same as before. If you want to use S3, we
> provide a better S3 implementation when registering template/ISO. If you
> want to use S3 without NFS storage, that's also definitely OK,  the framework
> is quite flexible to accommodate different solutions.
> >>
> >> Ok, let's talk  about the NFS storage cache issues.
> >> The issue about NFS cache storage is discussed in several threads, back
> and forth. All in all, the NFs cache storage is only one usage case out of three
> usage cases supported by object_store branch. It's not something that if it
> has issue, then everything doesn't work.
> >> In above 2.2 and 2.3, it shows how the NFS cache storage is involved
> during snapshot related operations. The complains about there is no aging
> policy, no capacity planner for NFS cache storage, is happened when
> download a snapshot from S3 into NFS, or copy a snapshot from primary
> storage into NFS, or download template from S3 into NFS. Yes, it's an issue,
> the NFS cache storage can be used out, if there is no capacity planner, and no
> aging out policy. But can it be fixed? Is it a design issue?
> >> Let's talk the code: Here is the code related to NFS cache storage,
> >> not much, only one class depends on NFS cache storage:
> >> https://git-wip-us.apache.org/repos/asf?p=cloudstack.git;a=blob;f=eng
> >>
> ine/storage/datamotion/src/org/apache/cloudstack/storage/motion/Ancie
> >>
> ntDataMotionStrategy.java;h=a01d2d30139f70ad8c907b6d6bc9759d47dcc2d6;
> >> hb=refs/heads/object_store Take copyVolumeFromSnapshot as example,
> >> which will be called when create Volume from snapshot, if first calls
> >> cacheSnapshotChain, which will call cacheMgr.createCacheObject to
> >> download the snapshot into NFs cache storage.
> >> StorageCacheManagerImpl-> createCacheObject is the only place to
> >> create objects on NFs cache storage, the code is at
> >> https://git-wip-us.apache.org/repos/asf?p=cloudstack.git;a=blob;f=eng
> >>
> ine/storage/cache/src/org/apache/cloudstack/storage/cache/manager/Sto
> >>
> rageCacheManagerImpl.java;h=cb5ea106fed3e5d2135dca7d98aede13effcf7d
> 9;
> >> hb=refs/heads/object_store In createCacheObject, it will first find
> >> out a cache storage, in case there are multiple cache storages available in
> a scope:
> >> DataStore cacheStore = this.getCacheStorage(scope); getCacheStorage
> >> will call StorageCacheAllocator to find out a proper NFS cache storage. So
> StorageCacheAllocator is the place to choose NFS cache storage based on
> certain criteria, the current implementation only randomly choose one of
> them, we can add a new allocator algorithm, based on capacity etc, etc.
> >> Regarding capacity reservation, there is already a table, called
> op_host_capacity which has entry for NFS secondary storage, we can reuse
> this entry to store capacity information about NFS cache storages(such as,
> total size, available/used capacity etc). So when every call createCacheObject,
> we can call StorageCacheAllocator to find out a proper NFS storage based on
> first fit criteria, then increase used capacity in op_host_capacity table. If the
> create cache object failed, return the capacity to op_host_capacity.
> >>
> >> Regarding the aging out policy, we can start a background thread on mgt
> server, which will scan all the objects created on NFS cache storage(the
> tables called: snapshot_store_ref, template_store_ref, volume_store_ref),
> each entry of these tables has a column called: updated, every time, when
> the object's state is changed, the "updated" column will be got updated also.
> When the object's state is changed? Every time, when the object is used in
> some contexts(such as copy the snapshot on NFS cache storage into
> somewhere), the object's state will be changed  accordingly, such as
> "Copying", means the object is being copied to some place, which is exactly
> the information we need to implement LRU algorithm.
> >>
> >> How do you guys think about the fix? If you have better solution, please
> let me know.
> >>
> >>
> >


Re: [DISCUSS] NFS cache storage issue on object_store

Posted by John Burwell <jb...@basho.com>.
Edison,

One thing I forgot to say is that reference counting be an unnecessary complexity in the event that sharing of the same resource by multiple process concurrently is rare.

Thanks,
-John

On Jun 5, 2013, at 4:04 PM, John Burwell <jb...@basho.com> wrote:

> Edison,
> 
> You have provided some great information below which helps greatly to understand the role of the "NFS cache" mechanism.  To summarize, this mechanism is only currently required for Xen snapshot operations driven by Xen's coalescing operations.  Is my understanding correct?  Just out of curiosity, is their a Xen expert on the list who can provide a high-level description of the coalescing operation -- in particular, the way it interacts with storage?  I have Googled a bit, and found very little information about it.  Has the object_store branch been tested with VMWare and KVM?  If so, what operations on these hypervisors have been tested?
> 
> In reading through the description below, my operation concerns remain regarding potential race conditions and resource exhaustion.  Also, in reading through the description, I think we should find a new name for this mechanism.  As Chip has previous mentioned, a cache implies the following characteristics:
> 
>    1. Optional: Systems can operate without caches just more slowly.  However, with this mechanism, snapshots on Xen will not function.
>    2. Volatility: Caches are backed by durable, non-volitale storage.  Therefore, if the cache's data is lost, it can be rebuilt from the backing store and no data will be permanently lost from the system.  However, this mechanism contains snapshots in-transit to an object store.  If the data contained in this "cache" were lost before its transfer to the object store completed, the snapshot data would be lost.
> 
> In order to set expectations with users and better frame our design conversation, I think it would be appropriate this mechanism as a staging, scratch, or temporary area.  I also recommend removing the notion of NFS its name as NFS is initial implementation of this mechanism.  In the future, I can see a desire for local filesystem, RBD, and iSCSI implementations of it.
> 
> In terms of solving the potential race conditions and resource exhaustion issues, I don't think an LRU approach will be sufficient because the least recently used resource may be still be in use by the system.  I think we should look to a reservation model with reference counting where files are deleted when once no processes are accessing them.  The following is a (handwave-handwave) overview of the process I think would meet these requirements:
> 
> 	1. Request a reservation for the maximum size of the file(s) that will be processed in the staging area.
> 		- If the file is already in the staging area, increase its reference count
> 		- If the reservation can not be fulfilled, we can either drop the process in a retry queue or reject it.  
> 	2. Perform work and transfer file(s) to/from the object store
> 	3. Release the file(s) -- decrementing the reference count.  When the reference count is <= 0, delete the file(s) from the staging area
> 
> We would also likely want to consider a TTL to purge files after a configurable period of inactivity as a backstop against crashed processes failing to properly decrementing the reference count.  In this model, we will either defer or reject work if resources are not available, and we properly bound resources.  
> 
> Finally, in terms of decoupling the decision to use of this mechanism by hypervisor plugins from the storage subsystem, I think we should expose methods on the secondary storage services that allow clients to explicitly request or create resources using files (i.e. java.io.File) instead of streams (e.g. createXXX(File) or readXXXAsFile).  These interfaces would provide the storage subsystem with the hint that the client requires file access to the request resource.   For object store plugins, this hint would be used to wrap the resource in an object that would transfer in and/out of the staging area.
> 
> Thoughts?
> -John
> 
> On Jun 3, 2013, at 7:17 PM, Edison Su <Ed...@citrix.com> wrote:
> 
>> Let's start a new thread about NFS cache storage issues on object_store.
>> First, I'll go through how NFS storage works on master branch, then how it works on object_store branch, then let's talk about the "issues".
>> 
>> 0.       Why we need NFS secondary storage? Nfs secondary storage is used as a place to store templates/snapshots etc, it's zone wide, and it's widely supported by most of hypervisors(except HyperV). NFS storage exists in CloudStack since 1.x. With the rising of object storage, like S3/Swift, CloudStack adds the support of Swift in 3.x, and S3 in 4.0. You may wonder, if S3/Swift is used as the place to store templates/snapshots, then why we still need NFS secondary storage?
>> 
>> There are two reasons for that:
>> 
>> a.       CloudStack storage code is tightly coupled with NFS secondary storage, so when adding Swift/S3 support, it's likely to take shortcut, leave NFS secondary storage as it is.
>> 
>> b.      Certain hypervisors, and certain storage related operations, can not directly operate on object storage.
>> Examples:
>> 
>> b.1 When backing up snapshot(the snapshot taken from xenserver hypervisor) from primary storage to S3 in xenserver
>> 
>> If there are snapshot chains on the volume, and if we want to coalesce the snapshot chains into a new disk, then copy it to S3, we either, coalesce the snapshot chains on primary storage, or on an extra storage repository (SR) that supported by Xenserver.
>> 
>> If we coalesce it on primary storage, then may blow up the primary storage, as the coalesced new disk may need a lot of space(thinking about, the new disk will contain all the content in from leaf snapshot, all the way up to base template), but the primary storage is not planned to this operation(cloudstack mgt server is unaware of this operation, the mgt server may think the primary storage still has enough space to create volumes).
>> 
>> While xenserver doesn't have API to coalesce snapshots directly to S3, so we have to use other storages that supported by Xenserver, that's why the NFS storage is used during snapshot backup. So what we did is that first call xenserver api to coalesce the snapshot to NFS storage, then copy the newly created file into S3. This is what we did on both master branch and object_store branch.
>>                              b.2 When create volume from snapshot if the snapshot is stored on S3.
>>                                                If the snapshot is a delta snapshot, we need to coalesce them into a new volume. We can't coalesce snapshots directly on S3, AFAIK, so we have to download the snapshot and its parents into somewhere, then coalesce them with xenserver's tools. Again, there are two options, one is to download all the snapshots into primary storage, or download them into NFS storage:
>>                                               If we download all the snapshots into primary storage directly from S3, then first we need find a way import snapshot from S3 into Primary storage(if primary storage is a block device, then need extra care) and then coalesce them. If we go this way, need to find a primary storage with enough space, and even worse, if the primary storage is not zone-wide, then later on, we may need to copy the volume from one primary storage to another, which is time consuming.
>>                                               If we download all the snapshots into NFS storage from S3, then coalesce them, and then copy the volume to primary storage. As the NFS storage is zone wide, so, you can copy the volume into whatever primary storage, without extra copy. This is what we did in master branch and object_store branch.
>>                             b.3, some hypervisors, or some storages do not support directly import template into primary storage from a URL. For example, if Ceph is used as primary storage, when import a template into RBD, need transform a Qcow2 image into RAW disk, then into RBD format 2. In order to transform an image from Qcow2 image into RAW disk, you need extra file system, either a local file system(this is what other stack does, which is not scalable to me), or a NFS storage(this is what can be done on both master and object_store). Or one can modify hypervisor or storage to support directly import template from S3 into RBD. Here is the link(http://www.mail-archive.com/ceph-devel@vger.kernel.org/msg14411.html), that Wido posted.
>>                Anyway, there are so many combination of hypervisors and storages: for some hypervisors with zone wide file system based storage(e.g. KVM + gluster/NFS as primary storage), you don't need extra nfs storage. Also if you are using VMware or HyperV, which can import template from a URL, regardless which storage your are using, then you don't need extra NFS storage. While if you are using xenserver, in order to create volume from delta snapshot, you will need a NFS storage, or if you are using KVM + Ceph, you also may need a NFS storage.
>>               Due to above reasons, NFS cache storage is need in certain cases if S3 is used as secondary storage. The combination of hypervisors and storages are quite complicated, to use cache storage or not, should be case by case. But as long as cloudstack provides a framework, gives people the choice to enable/disable cache storage on their own, then I think the framework is  good enough.
>> 
>> 
>> 1.       Then let's talk about how NFS storage works on master branch, with or without S3.
>> If S3 is not used, here is the how NFS storage is used:
>> 
>> 1.1   Register a template/ISO: cloudstack downloads the template/ISO into NFS storage.
>> 
>> 1.2   Backup snapshot: cloudstack sends a command to xenserver hypervisor, issue vdi.copy command copy the snapshot to NFS, for kvm, directly use "cp" or "qemu-img convert" to copy the snapshot into NFS storage.
>> 
>> 1.3   Create volume from snapshot: If the snapshot is a delta snapshot, coalesce them on NFS storage, then vdi.copy it from NFS to primary storage. If it's KVM, use "cp" or "qemu-img convert" to copy the snapshot from NFS storage to primary storage.
>> 
>> 
>>              If S3 is used:
>> 
>> 1.4   Register a template/ISO: download the template/ISO into NFS storage first, then there is background thread, which can upload the template/ISO from NFS storage into S3 regularly. The template is in Ready state, only means the template is stored on NFS storage, but admin doesn't know the template is stored on the S3 or not. Even worse, if there are multiple zones, cloudstack will copy the template from one zone wide NFS storage into another NFS storage in another zone, while there is already has a region wide S3 available. As the template is not directly uploaded to S3 when registering a template, it will take several copy in order to spread the template into a region wide.
>> 
>> 1.5   Backup snapshot: cloudstack sends a command to xenserver hypervisor, copy the snapshot to NFS storage, then immediately, upload the snapshot from NFS storage into S3. The snapshot is in Backedup state, not only means the snapshot is in  NFS storage, but also means it's stored on S3.
>> 
>> 1.6   Create volume from snapshot: download the snapshot  and it's parent snapshots from S3 into NFS storage, then coalesce and vdi.copy the volume from NFS to primary storage.
>> 
>> 
>> 
>> 2.       Then let's talk about how it works on object_store:
>> If S3 is not used, there is ZERO change from master branch. How the NFS secondary storage works before, is the same on object_store.
>> If S3 is used, and NFS cache storage used also(which is by default):
>>  2.1 Register a template/ISO: the template/ISO are directly uploaded to S3, there is no extra copy to NFS storage. When the template is in "Ready" state, means the template is stored on S3.                  It implies that: the template is immediately available in the region as soon as it's in Ready State. And admin can clearly knows the status of template on S3, what's percentage of the uploading, is it failed or succeed? Also if register template failed for some reason, admin can issue the register template command again. I would say the change of how to register template into S3 is far better than what we did on master branch.
>>  2.2 Backup snapshot: it's same as master branch, sends a command to xenserver host, copy the snapshot into NFS, then upload to S3.
>>  2.3 Create volume from snapshot: it's the same as master branch, download snapshot and it's parent snaphots from S3 into NFS, then copy it from NFS to primary storage.
>> From above few typical usage cases, you may understand how S3 and NFS cache storage is used, and what's difference between object_store branch and master branch: basically, we only change the way how to register a template, nothing else.
>> If S3 is used, and no NFS cache storage is used(it's possible, depends on which datamotion strategy is used):
>>   2.4 Register a template/ISO: it's the same as 2.1
>>   2.5 Backup snapshot: export the snapshot from primary storage into S3 directly
>>   2.6 Create volume from snapshot: download snapshots from S3 into primary storage directly, then coalesce and create volume from it.
>> 
>>         Hope above explanation will tell the truth how the system works on object_store, and clarify the misconception/misunderstanding  about object_store branch. Even the change is huge, we still maintain the back compatibility. If you don't want to use S3, only want to existing NFS storage, it's definitely OK, it works the same as before. If you want to use S3, we provide a better S3 implementation when registering template/ISO. If you want to use S3 without NFS storage, that's also definitely OK,  the framework is quite flexible to accommodate different solutions.
>> 
>> Ok, let's talk  about the NFS storage cache issues.
>> The issue about NFS cache storage is discussed in several threads, back and forth. All in all, the NFs cache storage is only one usage case out of three usage cases supported by object_store branch. It's not something that if it has issue, then everything doesn't work.
>> In above 2.2 and 2.3, it shows how the NFS cache storage is involved during snapshot related operations. The complains about there is no aging policy, no capacity planner for NFS cache storage, is happened when download a snapshot from S3 into NFS, or copy a snapshot from primary storage into NFS, or download template from S3 into NFS. Yes, it's an issue, the NFS cache storage can be used out, if there is no capacity planner, and no aging out policy. But can it be fixed? Is it a design issue?
>> Let's talk the code: Here is the code related to NFS cache storage, not much, only one class depends on NFS cache storage: https://git-wip-us.apache.org/repos/asf?p=cloudstack.git;a=blob;f=engine/storage/datamotion/src/org/apache/cloudstack/storage/motion/AncientDataMotionStrategy.java;h=a01d2d30139f70ad8c907b6d6bc9759d47dcc2d6;hb=refs/heads/object_store
>> Take copyVolumeFromSnapshot as example, which will be called when create Volume from snapshot, if first calls cacheSnapshotChain, which will call cacheMgr.createCacheObject to download the snapshot into NFs cache storage. StorageCacheManagerImpl-> createCacheObject is the only place to create objects on NFs cache storage, the code is at https://git-wip-us.apache.org/repos/asf?p=cloudstack.git;a=blob;f=engine/storage/cache/src/org/apache/cloudstack/storage/cache/manager/StorageCacheManagerImpl.java;h=cb5ea106fed3e5d2135dca7d98aede13effcf7d9;hb=refs/heads/object_store
>> In createCacheObject, it will first find out a cache storage, in case there are multiple cache storages available in a scope:
>> DataStore cacheStore = this.getCacheStorage(scope);
>> getCacheStorage will call StorageCacheAllocator to find out a proper NFS cache storage. So StorageCacheAllocator is the place to choose NFS cache storage based on certain criteria, the current implementation only randomly choose one of them, we can add a new allocator algorithm, based on capacity etc, etc.
>> Regarding capacity reservation, there is already a table, called op_host_capacity which has entry for NFS secondary storage, we can reuse this entry to store capacity information about NFS cache storages(such as, total size, available/used capacity etc). So when every call createCacheObject, we can call StorageCacheAllocator to find out a proper NFS storage based on first fit criteria, then increase used capacity in op_host_capacity table. If the create cache object failed, return the capacity to op_host_capacity.
>> 
>> Regarding the aging out policy, we can start a background thread on mgt server, which will scan all the objects created on NFS cache storage(the tables called: snapshot_store_ref, template_store_ref, volume_store_ref), each entry of these tables has a column called: updated, every time, when the object's state is changed, the "updated" column will be got updated also. When the object's state is changed? Every time, when the object is used in some contexts(such as copy the snapshot on NFS cache storage into somewhere), the object's state will be changed  accordingly, such as "Copying", means the object is being copied to some place, which is exactly the information we need to implement LRU algorithm.
>> 
>> How do you guys think about the fix? If you have better solution, please let me know.
>> 
>> 
> 


Re: [DISCUSS] NFS cache storage issue on object_store

Posted by John Burwell <jb...@basho.com>.
Edison,

You have provided some great information below which helps greatly to understand the role of the "NFS cache" mechanism.  To summarize, this mechanism is only currently required for Xen snapshot operations driven by Xen's coalescing operations.  Is my understanding correct?  Just out of curiosity, is their a Xen expert on the list who can provide a high-level description of the coalescing operation -- in particular, the way it interacts with storage?  I have Googled a bit, and found very little information about it.  Has the object_store branch been tested with VMWare and KVM?  If so, what operations on these hypervisors have been tested?

In reading through the description below, my operation concerns remain regarding potential race conditions and resource exhaustion.  Also, in reading through the description, I think we should find a new name for this mechanism.  As Chip has previous mentioned, a cache implies the following characteristics:

    1. Optional: Systems can operate without caches just more slowly.  However, with this mechanism, snapshots on Xen will not function.
    2. Volatility: Caches are backed by durable, non-volitale storage.  Therefore, if the cache's data is lost, it can be rebuilt from the backing store and no data will be permanently lost from the system.  However, this mechanism contains snapshots in-transit to an object store.  If the data contained in this "cache" were lost before its transfer to the object store completed, the snapshot data would be lost.

In order to set expectations with users and better frame our design conversation, I think it would be appropriate this mechanism as a staging, scratch, or temporary area.  I also recommend removing the notion of NFS its name as NFS is initial implementation of this mechanism.  In the future, I can see a desire for local filesystem, RBD, and iSCSI implementations of it.

In terms of solving the potential race conditions and resource exhaustion issues, I don't think an LRU approach will be sufficient because the least recently used resource may be still be in use by the system.  I think we should look to a reservation model with reference counting where files are deleted when once no processes are accessing them.  The following is a (handwave-handwave) overview of the process I think would meet these requirements:

	1. Request a reservation for the maximum size of the file(s) that will be processed in the staging area.
		- If the file is already in the staging area, increase its reference count
		- If the reservation can not be fulfilled, we can either drop the process in a retry queue or reject it.  
	2. Perform work and transfer file(s) to/from the object store
	3. Release the file(s) -- decrementing the reference count.  When the reference count is <= 0, delete the file(s) from the staging area

We would also likely want to consider a TTL to purge files after a configurable period of inactivity as a backstop against crashed processes failing to properly decrementing the reference count.  In this model, we will either defer or reject work if resources are not available, and we properly bound resources.  

Finally, in terms of decoupling the decision to use of this mechanism by hypervisor plugins from the storage subsystem, I think we should expose methods on the secondary storage services that allow clients to explicitly request or create resources using files (i.e. java.io.File) instead of streams (e.g. createXXX(File) or readXXXAsFile).  These interfaces would provide the storage subsystem with the hint that the client requires file access to the request resource.   For object store plugins, this hint would be used to wrap the resource in an object that would transfer in and/out of the staging area.

Thoughts?
-John

On Jun 3, 2013, at 7:17 PM, Edison Su <Ed...@citrix.com> wrote:

> Let's start a new thread about NFS cache storage issues on object_store.
> First, I'll go through how NFS storage works on master branch, then how it works on object_store branch, then let's talk about the "issues".
> 
> 0.       Why we need NFS secondary storage? Nfs secondary storage is used as a place to store templates/snapshots etc, it's zone wide, and it's widely supported by most of hypervisors(except HyperV). NFS storage exists in CloudStack since 1.x. With the rising of object storage, like S3/Swift, CloudStack adds the support of Swift in 3.x, and S3 in 4.0. You may wonder, if S3/Swift is used as the place to store templates/snapshots, then why we still need NFS secondary storage?
> 
> There are two reasons for that:
> 
> a.       CloudStack storage code is tightly coupled with NFS secondary storage, so when adding Swift/S3 support, it's likely to take shortcut, leave NFS secondary storage as it is.
> 
> b.      Certain hypervisors, and certain storage related operations, can not directly operate on object storage.
> Examples:
> 
> b.1 When backing up snapshot(the snapshot taken from xenserver hypervisor) from primary storage to S3 in xenserver
> 
> If there are snapshot chains on the volume, and if we want to coalesce the snapshot chains into a new disk, then copy it to S3, we either, coalesce the snapshot chains on primary storage, or on an extra storage repository (SR) that supported by Xenserver.
> 
> If we coalesce it on primary storage, then may blow up the primary storage, as the coalesced new disk may need a lot of space(thinking about, the new disk will contain all the content in from leaf snapshot, all the way up to base template), but the primary storage is not planned to this operation(cloudstack mgt server is unaware of this operation, the mgt server may think the primary storage still has enough space to create volumes).
> 
> While xenserver doesn't have API to coalesce snapshots directly to S3, so we have to use other storages that supported by Xenserver, that's why the NFS storage is used during snapshot backup. So what we did is that first call xenserver api to coalesce the snapshot to NFS storage, then copy the newly created file into S3. This is what we did on both master branch and object_store branch.
>                               b.2 When create volume from snapshot if the snapshot is stored on S3.
>                                                 If the snapshot is a delta snapshot, we need to coalesce them into a new volume. We can't coalesce snapshots directly on S3, AFAIK, so we have to download the snapshot and its parents into somewhere, then coalesce them with xenserver's tools. Again, there are two options, one is to download all the snapshots into primary storage, or download them into NFS storage:
>                                                If we download all the snapshots into primary storage directly from S3, then first we need find a way import snapshot from S3 into Primary storage(if primary storage is a block device, then need extra care) and then coalesce them. If we go this way, need to find a primary storage with enough space, and even worse, if the primary storage is not zone-wide, then later on, we may need to copy the volume from one primary storage to another, which is time consuming.
>                                                If we download all the snapshots into NFS storage from S3, then coalesce them, and then copy the volume to primary storage. As the NFS storage is zone wide, so, you can copy the volume into whatever primary storage, without extra copy. This is what we did in master branch and object_store branch.
>                              b.3, some hypervisors, or some storages do not support directly import template into primary storage from a URL. For example, if Ceph is used as primary storage, when import a template into RBD, need transform a Qcow2 image into RAW disk, then into RBD format 2. In order to transform an image from Qcow2 image into RAW disk, you need extra file system, either a local file system(this is what other stack does, which is not scalable to me), or a NFS storage(this is what can be done on both master and object_store). Or one can modify hypervisor or storage to support directly import template from S3 into RBD. Here is the link(http://www.mail-archive.com/ceph-devel@vger.kernel.org/msg14411.html), that Wido posted.
>                 Anyway, there are so many combination of hypervisors and storages: for some hypervisors with zone wide file system based storage(e.g. KVM + gluster/NFS as primary storage), you don't need extra nfs storage. Also if you are using VMware or HyperV, which can import template from a URL, regardless which storage your are using, then you don't need extra NFS storage. While if you are using xenserver, in order to create volume from delta snapshot, you will need a NFS storage, or if you are using KVM + Ceph, you also may need a NFS storage.
>                Due to above reasons, NFS cache storage is need in certain cases if S3 is used as secondary storage. The combination of hypervisors and storages are quite complicated, to use cache storage or not, should be case by case. But as long as cloudstack provides a framework, gives people the choice to enable/disable cache storage on their own, then I think the framework is  good enough.
> 
> 
> 1.       Then let's talk about how NFS storage works on master branch, with or without S3.
> If S3 is not used, here is the how NFS storage is used:
> 
> 1.1   Register a template/ISO: cloudstack downloads the template/ISO into NFS storage.
> 
> 1.2   Backup snapshot: cloudstack sends a command to xenserver hypervisor, issue vdi.copy command copy the snapshot to NFS, for kvm, directly use "cp" or "qemu-img convert" to copy the snapshot into NFS storage.
> 
> 1.3   Create volume from snapshot: If the snapshot is a delta snapshot, coalesce them on NFS storage, then vdi.copy it from NFS to primary storage. If it's KVM, use "cp" or "qemu-img convert" to copy the snapshot from NFS storage to primary storage.
> 
> 
>               If S3 is used:
> 
> 1.4   Register a template/ISO: download the template/ISO into NFS storage first, then there is background thread, which can upload the template/ISO from NFS storage into S3 regularly. The template is in Ready state, only means the template is stored on NFS storage, but admin doesn't know the template is stored on the S3 or not. Even worse, if there are multiple zones, cloudstack will copy the template from one zone wide NFS storage into another NFS storage in another zone, while there is already has a region wide S3 available. As the template is not directly uploaded to S3 when registering a template, it will take several copy in order to spread the template into a region wide.
> 
> 1.5   Backup snapshot: cloudstack sends a command to xenserver hypervisor, copy the snapshot to NFS storage, then immediately, upload the snapshot from NFS storage into S3. The snapshot is in Backedup state, not only means the snapshot is in  NFS storage, but also means it's stored on S3.
> 
> 1.6   Create volume from snapshot: download the snapshot  and it's parent snapshots from S3 into NFS storage, then coalesce and vdi.copy the volume from NFS to primary storage.
> 
> 
> 
> 2.       Then let's talk about how it works on object_store:
> If S3 is not used, there is ZERO change from master branch. How the NFS secondary storage works before, is the same on object_store.
> If S3 is used, and NFS cache storage used also(which is by default):
>   2.1 Register a template/ISO: the template/ISO are directly uploaded to S3, there is no extra copy to NFS storage. When the template is in "Ready" state, means the template is stored on S3.                  It implies that: the template is immediately available in the region as soon as it's in Ready State. And admin can clearly knows the status of template on S3, what's percentage of the uploading, is it failed or succeed? Also if register template failed for some reason, admin can issue the register template command again. I would say the change of how to register template into S3 is far better than what we did on master branch.
>   2.2 Backup snapshot: it's same as master branch, sends a command to xenserver host, copy the snapshot into NFS, then upload to S3.
>   2.3 Create volume from snapshot: it's the same as master branch, download snapshot and it's parent snaphots from S3 into NFS, then copy it from NFS to primary storage.
> From above few typical usage cases, you may understand how S3 and NFS cache storage is used, and what's difference between object_store branch and master branch: basically, we only change the way how to register a template, nothing else.
> If S3 is used, and no NFS cache storage is used(it's possible, depends on which datamotion strategy is used):
>    2.4 Register a template/ISO: it's the same as 2.1
>    2.5 Backup snapshot: export the snapshot from primary storage into S3 directly
>    2.6 Create volume from snapshot: download snapshots from S3 into primary storage directly, then coalesce and create volume from it.
> 
>          Hope above explanation will tell the truth how the system works on object_store, and clarify the misconception/misunderstanding  about object_store branch. Even the change is huge, we still maintain the back compatibility. If you don't want to use S3, only want to existing NFS storage, it's definitely OK, it works the same as before. If you want to use S3, we provide a better S3 implementation when registering template/ISO. If you want to use S3 without NFS storage, that's also definitely OK,  the framework is quite flexible to accommodate different solutions.
> 
> Ok, let's talk  about the NFS storage cache issues.
> The issue about NFS cache storage is discussed in several threads, back and forth. All in all, the NFs cache storage is only one usage case out of three usage cases supported by object_store branch. It's not something that if it has issue, then everything doesn't work.
> In above 2.2 and 2.3, it shows how the NFS cache storage is involved during snapshot related operations. The complains about there is no aging policy, no capacity planner for NFS cache storage, is happened when download a snapshot from S3 into NFS, or copy a snapshot from primary storage into NFS, or download template from S3 into NFS. Yes, it's an issue, the NFS cache storage can be used out, if there is no capacity planner, and no aging out policy. But can it be fixed? Is it a design issue?
> Let's talk the code: Here is the code related to NFS cache storage, not much, only one class depends on NFS cache storage: https://git-wip-us.apache.org/repos/asf?p=cloudstack.git;a=blob;f=engine/storage/datamotion/src/org/apache/cloudstack/storage/motion/AncientDataMotionStrategy.java;h=a01d2d30139f70ad8c907b6d6bc9759d47dcc2d6;hb=refs/heads/object_store
> Take copyVolumeFromSnapshot as example, which will be called when create Volume from snapshot, if first calls cacheSnapshotChain, which will call cacheMgr.createCacheObject to download the snapshot into NFs cache storage. StorageCacheManagerImpl-> createCacheObject is the only place to create objects on NFs cache storage, the code is at https://git-wip-us.apache.org/repos/asf?p=cloudstack.git;a=blob;f=engine/storage/cache/src/org/apache/cloudstack/storage/cache/manager/StorageCacheManagerImpl.java;h=cb5ea106fed3e5d2135dca7d98aede13effcf7d9;hb=refs/heads/object_store
> In createCacheObject, it will first find out a cache storage, in case there are multiple cache storages available in a scope:
> DataStore cacheStore = this.getCacheStorage(scope);
> getCacheStorage will call StorageCacheAllocator to find out a proper NFS cache storage. So StorageCacheAllocator is the place to choose NFS cache storage based on certain criteria, the current implementation only randomly choose one of them, we can add a new allocator algorithm, based on capacity etc, etc.
> Regarding capacity reservation, there is already a table, called op_host_capacity which has entry for NFS secondary storage, we can reuse this entry to store capacity information about NFS cache storages(such as, total size, available/used capacity etc). So when every call createCacheObject, we can call StorageCacheAllocator to find out a proper NFS storage based on first fit criteria, then increase used capacity in op_host_capacity table. If the create cache object failed, return the capacity to op_host_capacity.
> 
> Regarding the aging out policy, we can start a background thread on mgt server, which will scan all the objects created on NFS cache storage(the tables called: snapshot_store_ref, template_store_ref, volume_store_ref), each entry of these tables has a column called: updated, every time, when the object's state is changed, the "updated" column will be got updated also. When the object's state is changed? Every time, when the object is used in some contexts(such as copy the snapshot on NFS cache storage into somewhere), the object's state will be changed  accordingly, such as "Copying", means the object is being copied to some place, which is exactly the information we need to implement LRU algorithm.
> 
> How do you guys think about the fix? If you have better solution, please let me know.
> 
>