You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@cloudstack.apache.org by Min Chen <mi...@citrix.com> on 2013/06/03 20:07:20 UTC

[DISCUSS]OBJECT_STORE branch design: Error handling in case of S3 as native secondary storage

Hi there,
This thread is to address John's comments about missing error handling in S3 as secondary storage in object_store branch implementation. From previous merge email thread, I realize that we may not explain clearly in FS how S3 should work in new object_store branch, so causing several confusions. Let's make it clear here.

1. The goal of object_store branch is to make S3 serve as NATIVE secondary storage, not just a backup device as NFS secondary storage in master branch. We want to lead people to believe that their data (template, snapshot, volumes) are stored in S3 object store if they choose S3 as their cloudstack secondary storage. When users register template to S3, we are directly issuing S3 API to download template directly into S3 object store instead of downloading it to NFS secondary storage and then syncing to S3 by schedule done by master branch. When we tell users that their data is READY on their S3 secondary storage, it really means that it is ready to use from S3. Unlike this guarantee, in master, S3 as a backup device, snapshot may only be ready on NFS secondary storage, not in S3 due to any network connection issues, but we actually mislead users that their snapshot is ready on S3.

2. NFS cache only comes into picture when user choose S3 as their native secondary storage. The data stored in NFS cache is really temporary and serve as an intermediate transfer stage for CloudStack to manipulate data stored in S3, our design does not have any requirement that these intermediate data has to be persist there in NFS cache forever to make CloudStack functional. This is quite different from the role of NFS secondary storage for S3 in master branch, where we have to keep data there in NFS secondary storage since we cannot guarantee that data is READY on S3 due to background sync issue I will mention in a minute. Theoretically speaking, we should be able to implement a simple LRU or FIFO cache algorithm (with the assumption that we have proved 4.2 feature freeze extension vote) to age out old cache data without impacting any of CloudStack functionality using S3. Not sure if this is true for NFS secondary storage data for S3 in master branch, feels not based on my code understanding, but maybe I am just ignorant and too new to this part of code in master.

3. We have to admit that in current object_store implementation, we only try the S3 operations (put, get, etc) once and if it failed, and we just report error and user have to manually retry. On this aspect, we definitely can make it better by adding some re-try mechanism based on a global configured retry parameter. However, infinite retry in interacting with these external devices is always a bad idea from my past experience. Also, we disagree with John's comment about dropping previous background sync process is "a step back from the current Swift and S3 implementations present in 4.1.0". We agree that current master background sync process relieves admin from manual retry in case of some S3 errors (BTW, some errors will never recover even with background process, for example, capacity full), but it also caused another severe drawback, that is, give user misconception that their data is READY in S3, but actually not. Here is a simple example, users take a snapshot on one zone and backup to S3, based on S3 region-wide nature, it is very natural for them to think that they can immediately restore this snapshot on another zone. However, for current master implementation, this may fail. Due to S3 network connection issue at backup moment, snapshot may not be ready on S3, and only stored in zone-wide NFS secondary storage. Another backup sync process is not kicked in yet. If now users are trying to do restore action, it will doom failure in not finding proper snapshot. In our opinion, enhancing current object_store implementation with some configured retry logic should be a good compromise.

Thanks.
-min