You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@cloudstack.apache.org by John Burwell <jb...@basho.com> on 2013/06/03 16:18:36 UTC

Re: [MERGE]object_store branch into master

Edison/Chip,

Please see my comments in-line.

Thanks,
-John

On May 31, 2013, at 4:04 PM, Chip Childers <ch...@sungard.com> wrote:

> Comments inline:
> 
> On Thu, May 30, 2013 at 09:42:29PM +0000, Edison Su wrote:
>> 
>> 
>>> -----Original Message-----
>>> From: John Burwell [mailto:jburwell@basho.com]
>>> Sent: Thursday, May 30, 2013 7:43 AM
>>> To: dev@cloudstack.apache.org
>>> Subject: Re: [MERGE]object_store branch into master
>>> 
>>> It feels like we have jumped to a solution without completely understanding
>>> the scope of the problem and the associated assumptions.  We have a
>>> community of hypervisor experts who we should consult to ensure we have
>>> the best solution.  As such, I recommend mailing the list with the specific
>>> hypervisors and functions that you have been unable to interface to storage
>>> that does not present a filesystem.  I do not recall seeing such a discussion on
>>> the list previously.
>> 
>> If people using zone-wide primary storage, like, ceph/solidfire, then suddenly, there is no need for nfs cache storage, as zone-wide storage can be treated as both primary/secondary storage, S3 as the backup  storage. It's a simple but powerful solution.
>> Why we can't just add code to support this exciting new solutions? It's hard to do it on master branch, that's why Min and I worked hard to refactor the code, and remove nfs secondary storage dependency from management server as much as possible. All we know, nfs secondary storage is not scalable, not matter how fancy aging policy you have, how advanced capacity planner you have.
>> 
>> And that's one of reason I don't care that much about the issue with nfs cache storage, couldn't we put our energy on cloud style storage solution, instead of on the un-scalable storage?
> 
> Per your comment about you and Min working hard on this: nobody is
> saying that you didn't.  This isn't personal (or shouldn't be).  These
> are questions that are part of a consensus-based approach to
> development.
> 
>>> As I understand the goals of this enhancement, we will support additional
>>> secondary storage types and removing the assumption that secondary
>>> storage will always be NFS or have a filesystem.  As such, when a non-NFS
>>> type of secondary storage is employed, NFS is no longer the repository of
>>> record for this data.  We can always exceed available space in the repository
>>> of record, and the failure scenarios are relatively well understood (4.1.0) --
>>> operations will fail quickly and obviously.  However, as a transitory staging
>>> storage mechanism (4.2.0), the expectation of the user is the NFS storage will
>>> not be as reliable or large.  If the only solution we can provide for this
>>> problem is to recommend an NFS "cache" that is equal to the size of the
>>> object store itself then we have little to no progress addressing our user's
>> 
>> No, it's not true.  Admin can add multiple NFS cache storages if they want, there is no such requirement that NFS storage will be the same size of object store, I can't be that stupid.
>> It's the same thing that we are doing on the master branch: admin knows that one NFS secondary storage is not enough, so they can add multiple NFS secondary storage. And on the master branch,
>> There is no capacity planner for NFS secondary storage, if the code just randomly chooses one of NFS secondary storages, even if one of them are full. Yes, NFS secondary storage on master can be full, there is no way to aging out.
>> 
>> On the current object_store branch, it has the same behavior, admin can add multiple NFS cache storages, no capacity planner. While, in case nfs cache storage is full, admin can just simply remove the db entry related to cached object, and cleanup NFS cache storage, then suddenly, everything just works. 
>> 
>> From implementation point of view, I don't think there is any difference. 
> 
> It's an expectation issue.  Operators expect to be able to manage their
> storage capacity.  So the question is, for the NFS "Cache", how do they
> plan size requirements and manage that capacity?

The driver for employing an object store is to reduce the cost per GB of storage while maintaining reliability and availability.  Requiring NFS reduces, if not eliminates, this benefit because system architectures must ensure that the NFS "cache" (staging area) has sufficient capacity and reliability to hold data until it can be transferred to object storage.  How does adding multiple staging areas decrease complexity and cost?  As implemented, the NFS "cache" is unbounded meaning that an operator would need to have a NFS "cache" as large as object storage to avoid data loss and/or operational failures.

> 
>> 
>> 
>>> needs.  Fundamentally, the role of the NFS is different in 4.2.0 than 4.1.0.
>>> Therefore, I disagree with the assertion that issue is present in 4.1.0.
>> 
>> The role of NFS can be changed, but they share the same problem, no capacity planner, no aging out policy. 
>> 
> 
> Secondary storage capacity management is much easier to grok for
> operators.  I would bet that almost 100% of the time, their usage grows
> on a particular slope, allowing them to plan and allocate more when
> needed.
> 
> For the NFS "cache", lifecycle of objects stored in that location,
> especially cleanup routines, are going to be critical to the healthy
> operation of that environment.

+1. 

> 
>>> 
>>> An additional risk in the object_store implementation is that we lead a user
>>> to believe their data has been stored in reliable storage (e.g. S3, Riak CS, etc)
>>> when it may not.  I saw no provision in the object_store to retry transfers if
>> 
>> I don't know from which code you get this kind of conclusion. Could you help to point out in the code?
>> AFAIK, the object can only be either stored in S3 or not stored in S3, I don't know how  the object can be in a wrong state.
>> 
>>> the object_store transfer fails or becomes unavailable.  In 4.0.0/4.1.0, if we
>>> can't connect to S3 or Swift, a background process continuously retries the
>>> upload until successful.
>> 
>> Here is the interesting situation coming out: how the mgt server or admin knows that background process push the objects successfully into s3? There is no guarantee the background process will success, there is no status track for this background process, right?
>> 
>> What I am doing on the object_store branch is that, if push object into S3 failed, then the whole backup process failed, admin or user needs to send out another API request to push object into S3. This will guarantee that operation will either success or failed, instead of in a unknown state that we are doing on master branch. 
>> 
> 
> That's the right approach IMO (at least it's correct, per the current
> model of operations either working or not).

As I previously stated, this functionality is a step back from the current Swift and S3 implementations present in 4.1.0.  I also think it is an unreasonable burden to place on an operator to check that every possible transfer succeeded and then issue a retry of the copy.

I am also curious about the phrase "backup".  My understanding of this branch's goals was to support object stores as native secondary storage.  4.1.0 already supports backing up secondary storage to Swift and S3.  Is your vision for object_store that object stores can be used as native secondary storage?

> 
>>> 
>>> Finally, I see this issue as a design issue than a bug.  I don't think we should
>> 
>> Again, I don't think it's a design issue, as I said above, it's a bug, both master branch and object_store have the same bug. It can be fixed, and easy to be fixed on object_store comparing with fixing it on master branch. And it's not an important issue, comparing to support cloud style storage solution.
>> 
> 
> Can we discuss fixing it in the object_store branch then?

Could you please define what you mean by a cloud style storage solution?  

> 
>>> Given the different use of NFS in the object_store branch vs. current, I don't
>>> see the comparison in this case.  In the current implementation, when we
>>> exhaust space, we are truly out of resource.  However, in the object_store
>>> branch, we have no provision to remove stale data and we may report no
>>> space available when there is plenty of space available in the underlying
>>> object store.  In this scenario, the NFS "cache" becomes an artificial limiter on
>>> the capacity of the system.  I do not understand how we have this problem in
>>> current since the object store is only a backup of secondary store -- not
>>> secondary storage itself.
>> 
>> As I said before, no matter what's the role of NFS storage, it shares the same issue, both NFS storage can be out of capacity, no capacity planner, no aging policy. 
>> 
> 
> But as I note above, the operator's planning process will be quite
> difficult.

Also, as I previously noted, the exhaustion is a completely different cause.  In 4.1, I am truly out of the secondary storage.  As Chip mentioned, it is straightforward to plan for space requirements.  In object_store, I likely am not exhausted of secondary storage space, but have filled the cache.  Since most operators will want as a little NFS space as necessary in this scenario, my educated guess is that we will see exhaustion of cache far more frequently.

> 
>>> It is my estimate robust error handling will require design changes (e.g.
>>> introduction of a resource reservation mechanism, introduction of addition
>>> exception classes, enhancement of interfaces to provide more context
>>> regarding client intentions, etc) yielding significant code impact.  These
>>> changes need to undertaken in a holistic manner with minimum risk to
>>> master.   Fundamentally, we should not be merging code to master with
>>> known significant issues.  When it goes to master, we should be saying, "To
>>> the best of my knowledge and developer testing, there are no blocker or
>>> critical issues."  In my opinion, omission of robust error handling does not
>>> meet that standard.
>> 
>> To be realistic, on the mgt server, there is only one class which is depended on cache storage, there is only one interface needs to be implemented to solve the issue, why we need redesign?
> 
> Right, let's look at how to deal with it cleanly within that
> implementation (although I suspect that the changes will leak out of
> that class).
> 

The lack of error handling extends beyond the cache.  The entire branch needs to be evaluated for exception handling.


Re: [MERGE]object_store branch into master

Posted by Min Chen <mi...@citrix.com>.
Sure. Edison will start one soon with this context information.

Thanks
-min

On 6/3/13 10:33 AM, "John Burwell" <jb...@basho.com> wrote:

>Chip/Min,
>
>For thread 1, I would like to see an expanded discussion regarding the
>need for the staging area.  In particular, what features on which
>hypervisors created the need for it.  With the wider expertise of the
>list, we may be able to find solutions to these issues that either reduce
>or eliminate the need for the cache.
>
>Thanks,
>-John
>
>On Jun 3, 2013, at 1:11 PM, Chip Childers <ch...@sungard.com>
>wrote:
>
>> On Mon, Jun 03, 2013 at 05:09:24PM +0000, Min Chen wrote:
>>> Chip/John,
>>> 
>>> 	This thread has become very hard to follow due to several technical
>>> debates mixed together. Chip earlier made a good suggestion that we
>>>should
>>> start separate threads for several important architectural issues
>>>raised
>>> by John so that community can get clear grasp on the debating issues
>>>and
>>> reach a wise conclusion. If there is no objection, we are going to do
>>>that
>>> right now. If we understood correctly by following through this
>>>thread, we
>>> boiled down to the following 3 major technical issues:
>>> 	1. Missing capacity planning in NFS cache storage implementation.
>>> 	2. Error handling in case of S3 as native secondary storage.
>>> 	3. S3TemplateDownloader implementation issue.
>>> If we didn't miss anything, we will start these 3 DISCUSS threads
>>>shortly.
>>> 
>>> 	Thanks
>>> 	-min
>> 
>> +1 - do it!
>


Re: [MERGE]object_store branch into master

Posted by John Burwell <jb...@basho.com>.
Chip/Min,

For thread 1, I would like to see an expanded discussion regarding the need for the staging area.  In particular, what features on which hypervisors created the need for it.  With the wider expertise of the list, we may be able to find solutions to these issues that either reduce or eliminate the need for the cache.

Thanks,
-John

On Jun 3, 2013, at 1:11 PM, Chip Childers <ch...@sungard.com> wrote:

> On Mon, Jun 03, 2013 at 05:09:24PM +0000, Min Chen wrote:
>> Chip/John,
>> 
>> 	This thread has become very hard to follow due to several technical
>> debates mixed together. Chip earlier made a good suggestion that we should
>> start separate threads for several important architectural issues raised
>> by John so that community can get clear grasp on the debating issues and
>> reach a wise conclusion. If there is no objection, we are going to do that
>> right now. If we understood correctly by following through this thread, we
>> boiled down to the following 3 major technical issues:
>> 	1. Missing capacity planning in NFS cache storage implementation.
>> 	2. Error handling in case of S3 as native secondary storage.
>> 	3. S3TemplateDownloader implementation issue.
>> If we didn't miss anything, we will start these 3 DISCUSS threads shortly.
>> 
>> 	Thanks
>> 	-min
> 
> +1 - do it!


Re: [MERGE]object_store branch into master

Posted by Chip Childers <ch...@sungard.com>.
On Mon, Jun 03, 2013 at 05:09:24PM +0000, Min Chen wrote:
> Chip/John,
> 
> 	This thread has become very hard to follow due to several technical
> debates mixed together. Chip earlier made a good suggestion that we should
> start separate threads for several important architectural issues raised
> by John so that community can get clear grasp on the debating issues and
> reach a wise conclusion. If there is no objection, we are going to do that
> right now. If we understood correctly by following through this thread, we
> boiled down to the following 3 major technical issues:
> 	1. Missing capacity planning in NFS cache storage implementation.
> 	2. Error handling in case of S3 as native secondary storage.
> 	3. S3TemplateDownloader implementation issue.
> If we didn't miss anything, we will start these 3 DISCUSS threads shortly.
> 
> 	Thanks
> 	-min

+1 - do it!

Re: [MERGE]object_store branch into master

Posted by Min Chen <mi...@citrix.com>.
Chip/John,

	This thread has become very hard to follow due to several technical
debates mixed together. Chip earlier made a good suggestion that we should
start separate threads for several important architectural issues raised
by John so that community can get clear grasp on the debating issues and
reach a wise conclusion. If there is no objection, we are going to do that
right now. If we understood correctly by following through this thread, we
boiled down to the following 3 major technical issues:
	1. Missing capacity planning in NFS cache storage implementation.
	2. Error handling in case of S3 as native secondary storage.
	3. S3TemplateDownloader implementation issue.
If we didn't miss anything, we will start these 3 DISCUSS threads shortly.

	Thanks
	-min

On 6/3/13 7:18 AM, "John Burwell" <jb...@basho.com> wrote:

>Edison/Chip,
>
>Please see my comments in-line.
>
>Thanks,
>-John
>
>On May 31, 2013, at 4:04 PM, Chip Childers <ch...@sungard.com>
>wrote:
>
>> Comments inline:
>> 
>> On Thu, May 30, 2013 at 09:42:29PM +0000, Edison Su wrote:
>>> 
>>> 
>>>> -----Original Message-----
>>>> From: John Burwell [mailto:jburwell@basho.com]
>>>> Sent: Thursday, May 30, 2013 7:43 AM
>>>> To: dev@cloudstack.apache.org
>>>> Subject: Re: [MERGE]object_store branch into master
>>>> 
>>>> It feels like we have jumped to a solution without completely
>>>>understanding
>>>> the scope of the problem and the associated assumptions.  We have a
>>>> community of hypervisor experts who we should consult to ensure we
>>>>have
>>>> the best solution.  As such, I recommend mailing the list with the
>>>>specific
>>>> hypervisors and functions that you have been unable to interface to
>>>>storage
>>>> that does not present a filesystem.  I do not recall seeing such a
>>>>discussion on
>>>> the list previously.
>>> 
>>> If people using zone-wide primary storage, like, ceph/solidfire, then
>>>suddenly, there is no need for nfs cache storage, as zone-wide storage
>>>can be treated as both primary/secondary storage, S3 as the backup
>>>storage. It's a simple but powerful solution.
>>> Why we can't just add code to support this exciting new solutions?
>>>It's hard to do it on master branch, that's why Min and I worked hard
>>>to refactor the code, and remove nfs secondary storage dependency from
>>>management server as much as possible. All we know, nfs secondary
>>>storage is not scalable, not matter how fancy aging policy you have,
>>>how advanced capacity planner you have.
>>> 
>>> And that's one of reason I don't care that much about the issue with
>>>nfs cache storage, couldn't we put our energy on cloud style storage
>>>solution, instead of on the un-scalable storage?
>> 
>> Per your comment about you and Min working hard on this: nobody is
>> saying that you didn't.  This isn't personal (or shouldn't be).  These
>> are questions that are part of a consensus-based approach to
>> development.
>> 
>>>> As I understand the goals of this enhancement, we will support
>>>>additional
>>>> secondary storage types and removing the assumption that secondary
>>>> storage will always be NFS or have a filesystem.  As such, when a
>>>>non-NFS
>>>> type of secondary storage is employed, NFS is no longer the
>>>>repository of
>>>> record for this data.  We can always exceed available space in the
>>>>repository
>>>> of record, and the failure scenarios are relatively well understood
>>>>(4.1.0) --
>>>> operations will fail quickly and obviously.  However, as a transitory
>>>>staging
>>>> storage mechanism (4.2.0), the expectation of the user is the NFS
>>>>storage will
>>>> not be as reliable or large.  If the only solution we can provide for
>>>>this
>>>> problem is to recommend an NFS "cache" that is equal to the size of
>>>>the
>>>> object store itself then we have little to no progress addressing our
>>>>user's
>>> 
>>> No, it's not true.  Admin can add multiple NFS cache storages if they
>>>want, there is no such requirement that NFS storage will be the same
>>>size of object store, I can't be that stupid.
>>> It's the same thing that we are doing on the master branch: admin
>>>knows that one NFS secondary storage is not enough, so they can add
>>>multiple NFS secondary storage. And on the master branch,
>>> There is no capacity planner for NFS secondary storage, if the code
>>>just randomly chooses one of NFS secondary storages, even if one of
>>>them are full. Yes, NFS secondary storage on master can be full, there
>>>is no way to aging out.
>>> 
>>> On the current object_store branch, it has the same behavior, admin
>>>can add multiple NFS cache storages, no capacity planner. While, in
>>>case nfs cache storage is full, admin can just simply remove the db
>>>entry related to cached object, and cleanup NFS cache storage, then
>>>suddenly, everything just works.
>>> 
>>> From implementation point of view, I don't think there is any
>>>difference. 
>> 
>> It's an expectation issue.  Operators expect to be able to manage their
>> storage capacity.  So the question is, for the NFS "Cache", how do they
>> plan size requirements and manage that capacity?
>
>The driver for employing an object store is to reduce the cost per GB of
>storage while maintaining reliability and availability.  Requiring NFS
>reduces, if not eliminates, this benefit because system architectures
>must ensure that the NFS "cache" (staging area) has sufficient capacity
>and reliability to hold data until it can be transferred to object
>storage.  How does adding multiple staging areas decrease complexity and
>cost?  As implemented, the NFS "cache" is unbounded meaning that an
>operator would need to have a NFS "cache" as large as object storage to
>avoid data loss and/or operational failures.
>
>> 
>>> 
>>> 
>>>> needs.  Fundamentally, the role of the NFS is different in 4.2.0 than
>>>>4.1.0.
>>>> Therefore, I disagree with the assertion that issue is present in
>>>>4.1.0.
>>> 
>>> The role of NFS can be changed, but they share the same problem, no
>>>capacity planner, no aging out policy.
>>> 
>> 
>> Secondary storage capacity management is much easier to grok for
>> operators.  I would bet that almost 100% of the time, their usage grows
>> on a particular slope, allowing them to plan and allocate more when
>> needed.
>> 
>> For the NFS "cache", lifecycle of objects stored in that location,
>> especially cleanup routines, are going to be critical to the healthy
>> operation of that environment.
>
>+1. 
>
>> 
>>>> 
>>>> An additional risk in the object_store implementation is that we lead
>>>>a user
>>>> to believe their data has been stored in reliable storage (e.g. S3,
>>>>Riak CS, etc)
>>>> when it may not.  I saw no provision in the object_store to retry
>>>>transfers if
>>> 
>>> I don't know from which code you get this kind of conclusion. Could
>>>you help to point out in the code?
>>> AFAIK, the object can only be either stored in S3 or not stored in S3,
>>>I don't know how  the object can be in a wrong state.
>>> 
>>>> the object_store transfer fails or becomes unavailable.  In
>>>>4.0.0/4.1.0, if we
>>>> can't connect to S3 or Swift, a background process continuously
>>>>retries the
>>>> upload until successful.
>>> 
>>> Here is the interesting situation coming out: how the mgt server or
>>>admin knows that background process push the objects successfully into
>>>s3? There is no guarantee the background process will success, there is
>>>no status track for this background process, right?
>>> 
>>> What I am doing on the object_store branch is that, if push object
>>>into S3 failed, then the whole backup process failed, admin or user
>>>needs to send out another API request to push object into S3. This will
>>>guarantee that operation will either success or failed, instead of in a
>>>unknown state that we are doing on master branch.
>>> 
>> 
>> That's the right approach IMO (at least it's correct, per the current
>> model of operations either working or not).
>
>As I previously stated, this functionality is a step back from the
>current Swift and S3 implementations present in 4.1.0.  I also think it
>is an unreasonable burden to place on an operator to check that every
>possible transfer succeeded and then issue a retry of the copy.
>
>I am also curious about the phrase "backup".  My understanding of this
>branch's goals was to support object stores as native secondary storage.
>4.1.0 already supports backing up secondary storage to Swift and S3.  Is
>your vision for object_store that object stores can be used as native
>secondary storage?
>
>> 
>>>> 
>>>> Finally, I see this issue as a design issue than a bug.  I don't
>>>>think we should
>>> 
>>> Again, I don't think it's a design issue, as I said above, it's a bug,
>>>both master branch and object_store have the same bug. It can be fixed,
>>>and easy to be fixed on object_store comparing with fixing it on master
>>>branch. And it's not an important issue, comparing to support cloud
>>>style storage solution.
>>> 
>> 
>> Can we discuss fixing it in the object_store branch then?
>
>Could you please define what you mean by a cloud style storage solution?
>
>> 
>>>> Given the different use of NFS in the object_store branch vs.
>>>>current, I don't
>>>> see the comparison in this case.  In the current implementation, when
>>>>we
>>>> exhaust space, we are truly out of resource.  However, in the
>>>>object_store
>>>> branch, we have no provision to remove stale data and we may report no
>>>> space available when there is plenty of space available in the
>>>>underlying
>>>> object store.  In this scenario, the NFS "cache" becomes an
>>>>artificial limiter on
>>>> the capacity of the system.  I do not understand how we have this
>>>>problem in
>>>> current since the object store is only a backup of secondary store --
>>>>not
>>>> secondary storage itself.
>>> 
>>> As I said before, no matter what's the role of NFS storage, it shares
>>>the same issue, both NFS storage can be out of capacity, no capacity
>>>planner, no aging policy.
>>> 
>> 
>> But as I note above, the operator's planning process will be quite
>> difficult.
>
>Also, as I previously noted, the exhaustion is a completely different
>cause.  In 4.1, I am truly out of the secondary storage.  As Chip
>mentioned, it is straightforward to plan for space requirements.  In
>object_store, I likely am not exhausted of secondary storage space, but
>have filled the cache.  Since most operators will want as a little NFS
>space as necessary in this scenario, my educated guess is that we will
>see exhaustion of cache far more frequently.
>
>> 
>>>> It is my estimate robust error handling will require design changes
>>>>(e.g.
>>>> introduction of a resource reservation mechanism, introduction of
>>>>addition
>>>> exception classes, enhancement of interfaces to provide more context
>>>> regarding client intentions, etc) yielding significant code impact.
>>>>These
>>>> changes need to undertaken in a holistic manner with minimum risk to
>>>> master.   Fundamentally, we should not be merging code to master with
>>>> known significant issues.  When it goes to master, we should be
>>>>saying, "To
>>>> the best of my knowledge and developer testing, there are no blocker
>>>>or
>>>> critical issues."  In my opinion, omission of robust error handling
>>>>does not
>>>> meet that standard.
>>> 
>>> To be realistic, on the mgt server, there is only one class which is
>>>depended on cache storage, there is only one interface needs to be
>>>implemented to solve the issue, why we need redesign?
>> 
>> Right, let's look at how to deal with it cleanly within that
>> implementation (although I suspect that the changes will leak out of
>> that class).
>> 
>
>The lack of error handling extends beyond the cache.  The entire branch
>needs to be evaluated for exception handling.
>