You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hbase.apache.org by Josh Elser <el...@apache.org> on 2017/09/12 16:07:02 UTC

[DISCUSS] Plan for Distributed testing of Backup and Restore

On 9/11/17 11:52 PM, Stack wrote:
> On Mon, Sep 11, 2017 at 11:07 AM, Vladimir Rodionov <vl...@gmail.com>
> wrote:
> 
>> ...
>> That is mostly it. Yes, We have not done real testing with real data on a
>> real cluster yet, except QA  testing on a small OpenStack
>> cluster (10 nodes). That is our probably the biggest minus right now. I
>> would like to inform community that this week we are going to start
>> full scale testing with reasonably sized data sets.
>>
> ... Completion of HA seems important as is result of the scale testing.
> 

I think we should knock out a rough sketch on what effective "scale" 
testing would look like since that is a very subjective phrase. Let me 
start the ball rolling with a few things that come to my mind.

(interpreting requirements as per rfc2119)

* MUST have >5 RegionServers and >1 Masters in play
* MUST have Non-trivial final data sizes (final data size would be >= 
100's of GB)
* MUST have some clear pass/fail determination for correctness of B&R
* MUST have some fault-injection

* SHOULD be a completely automated test, not require coordination of a 
human to executing commands.
* SHOULD be able to acquire operational insight (metrics) while 
performing operations to determine success of testing
* SHOULD NOT require manual intervention, e.g. working around known 
issues/limitations
* SHOULD reuse the IntegrationTest framework in hbase-it

Since we have a concern of correctness, ITBLL sounds like a good 
starting point to avoid having to re-write similar kinds of logic. 
ChaosMonkey is always great for fault-injection.

Thoughts?

Re: [DISCUSS] Plan for Distributed testing of Backup and Restore

Posted by Josh Elser <el...@apache.org>.

On 9/12/17 2:51 PM, Andrew Purtell wrote:
>> making backup working in challenging conditions was not a goal of FT
> design, correct failure handling was a goal.
> 
> Every real-world production environment has challenging conditions.
> 
> That said, making progress in the face of failures is only one aspect of
> FT, and an equally valid one is that failures do not cause data corruption.
> 
> If testing with chaos proves this backup solution will fail if there is any
> failure while backup is in progress, but at least it will successfully
> clean up and not corrupt existing state - that could be ok, for some.
> Possibly, us.

Agreed. There are always differences of opinion around acceptable levels 
of tolerance. Understanding how things fail (avoiding the need for 
manual interaction to correct) is a good initial goal-post as we can 
concisely document that for users. My impression is that this wouldn't 
require a significant amount of work to achieve an acceptable degree of 
stability.

> If testing with chaos proves this backup solution will not suffer
> corruption if there is a failure *and* can still successfully complete if
> there is any failure while backup is in progress - that would obviously
> improve the perceived value proposition.
> 
> It would be fine to test this using hbase-it chaos facilities but with a
> less aggressive policy than slowDeterministic that allows for backups to
> successfully complete once in a while yet also demonstrate that when the
> failures do happen things are properly cleaned up and data corruption does
> not happen.
> 
> 
> 
> 
> On Tue, Sep 12, 2017 at 11:25 AM, Vladimir Rodionov <vl...@gmail.com>
> wrote:
> 
>>>> Vlad: I'm obviously curious to see what you think about this stuff, in
>> addition to what you already had in mind :)
>>
>> Yes, I think that we need a test tool similar to ITBLL. Btw, making backup
>> working in challenging conditions was not a goal of FT design, correct
>> failure handling was a goal.

Based on Ted's mention of ITBackupRestore (thanks btwm Ted!), I think 
that gets into the details a little to much for this thread. Definitely 
need to improve on that test for what we're discussing here, but perhaps 
it's a nice starting point?

>> On Tue, Sep 12, 2017 at 9:53 AM, Josh Elser <el...@apache.org> wrote:
>>
>>> Thanks for the quick feedback!
>>>
>>> On 9/12/17 12:36 PM, Stack wrote:
>>>
>>>> On Tue, Sep 12, 2017 at 9:33 AM, Andrew Purtell <
>> andrew.purtell@gmail.com
>>>>>
>>>> wrote:
>>>>
>>>> I think those are reasonable criteria Josh.
>>>>>
>>>>> What I would like to see is something like "we ran ITBLL (or custom
>>>>> generator with similar correctness validation if you prefer) on a dev
>>>>> cluster (5-10 nodes) for 24 hours with server killing chaos agents
>>>>> active,
>>>>> attempted 1,440 backups (one per minute), of which 1,000 succeeded and
>>>>> 100%
>>>>> if these were successfully restored and validated." This implies your
>>>>> points on automation and no manual intervention. Maybe the number of
>>>>> successful backups under challenging conditions will be lower. Point is
>>>>> they demonstrate we can rely on it even when a cluster is partially
>>>>> unhealthy, which in production is often the normal order of affairs.
>>>>>
>>>>>
>>>>>
>>> I like it. I hadn't thought about stressing quite this aggressively, but
>>> now that I think about it, sounds like a great plan. Having some ballpark
>>> measure to quantify the cost of a "backup-heavy" workload would be cool
>> in
>>> addition to seeing how the system reacts in unexpected manners.
>>>
>>> Sounds good to me.
>>>>
>>>> How will you test the restore aspect? After 1k (or whatever makes sense)
>>>> incremental backups over the life of the chaos, could you restore and
>>>> validate that the table had all expected data in place.
>>>>
>>>
>>> Exactly. My thinking was that, at any point, we should be able to do a
>>> restore and validate. Maybe something like: every Nth ITBLL iteration,
>> make
>>> a new backup point, restore a previous backup point, verify, restore to
>>> newest backup point. The previous backup point should be a full or
>>> incremental point.
>>>
>>> Vlad: I'm obviously curious to see what you think about this stuff, in
>>> addition to what you already had in mind :)
>>>
>>
> 
> 
> 

Re: [DISCUSS] Plan for Distributed testing of Backup and Restore

Posted by Vladimir Rodionov <vl...@gmail.com>.
Yes, we have already some IT, so will need to upgrade it for scale testing.

On Tue, Sep 12, 2017 at 11:28 AM, Ted Yu <yu...@gmail.com> wrote:

> bq. we need a test tool similar to ITBLL
>
> How about making the following such a tool ?
>
> hbase-it/src/test/java/org/apache/hadoop/hbase/
> IntegrationTestBackupRestore.java
>
> On Tue, Sep 12, 2017 at 11:25 AM, Vladimir Rodionov <
> vladrodionov@gmail.com>
> wrote:
>
> > >> Vlad: I'm obviously curious to see what you think about this stuff, in
> > addition to what you already had in mind :)
> >
> > Yes, I think that we need a test tool similar to ITBLL. Btw, making
> backup
> > working in challenging conditions was not a goal of FT design, correct
> > failure handling was a goal.
> >
> > On Tue, Sep 12, 2017 at 9:53 AM, Josh Elser <el...@apache.org> wrote:
> >
> > > Thanks for the quick feedback!
> > >
> > > On 9/12/17 12:36 PM, Stack wrote:
> > >
> > >> On Tue, Sep 12, 2017 at 9:33 AM, Andrew Purtell <
> > andrew.purtell@gmail.com
> > >> >
> > >> wrote:
> > >>
> > >> I think those are reasonable criteria Josh.
> > >>>
> > >>> What I would like to see is something like "we ran ITBLL (or custom
> > >>> generator with similar correctness validation if you prefer) on a dev
> > >>> cluster (5-10 nodes) for 24 hours with server killing chaos agents
> > >>> active,
> > >>> attempted 1,440 backups (one per minute), of which 1,000 succeeded
> and
> > >>> 100%
> > >>> if these were successfully restored and validated." This implies your
> > >>> points on automation and no manual intervention. Maybe the number of
> > >>> successful backups under challenging conditions will be lower. Point
> is
> > >>> they demonstrate we can rely on it even when a cluster is partially
> > >>> unhealthy, which in production is often the normal order of affairs.
> > >>>
> > >>>
> > >>>
> > > I like it. I hadn't thought about stressing quite this aggressively,
> but
> > > now that I think about it, sounds like a great plan. Having some
> ballpark
> > > measure to quantify the cost of a "backup-heavy" workload would be cool
> > in
> > > addition to seeing how the system reacts in unexpected manners.
> > >
> > > Sounds good to me.
> > >>
> > >> How will you test the restore aspect? After 1k (or whatever makes
> sense)
> > >> incremental backups over the life of the chaos, could you restore and
> > >> validate that the table had all expected data in place.
> > >>
> > >
> > > Exactly. My thinking was that, at any point, we should be able to do a
> > > restore and validate. Maybe something like: every Nth ITBLL iteration,
> > make
> > > a new backup point, restore a previous backup point, verify, restore to
> > > newest backup point. The previous backup point should be a full or
> > > incremental point.
> > >
> > > Vlad: I'm obviously curious to see what you think about this stuff, in
> > > addition to what you already had in mind :)
> > >
> >
>

Re: [DISCUSS] Plan for Distributed testing of Backup and Restore

Posted by Ted Yu <yu...@gmail.com>.
bq. we need a test tool similar to ITBLL

How about making the following such a tool ?

hbase-it/src/test/java/org/apache/hadoop/hbase/IntegrationTestBackupRestore.java

On Tue, Sep 12, 2017 at 11:25 AM, Vladimir Rodionov <vl...@gmail.com>
wrote:

> >> Vlad: I'm obviously curious to see what you think about this stuff, in
> addition to what you already had in mind :)
>
> Yes, I think that we need a test tool similar to ITBLL. Btw, making backup
> working in challenging conditions was not a goal of FT design, correct
> failure handling was a goal.
>
> On Tue, Sep 12, 2017 at 9:53 AM, Josh Elser <el...@apache.org> wrote:
>
> > Thanks for the quick feedback!
> >
> > On 9/12/17 12:36 PM, Stack wrote:
> >
> >> On Tue, Sep 12, 2017 at 9:33 AM, Andrew Purtell <
> andrew.purtell@gmail.com
> >> >
> >> wrote:
> >>
> >> I think those are reasonable criteria Josh.
> >>>
> >>> What I would like to see is something like "we ran ITBLL (or custom
> >>> generator with similar correctness validation if you prefer) on a dev
> >>> cluster (5-10 nodes) for 24 hours with server killing chaos agents
> >>> active,
> >>> attempted 1,440 backups (one per minute), of which 1,000 succeeded and
> >>> 100%
> >>> if these were successfully restored and validated." This implies your
> >>> points on automation and no manual intervention. Maybe the number of
> >>> successful backups under challenging conditions will be lower. Point is
> >>> they demonstrate we can rely on it even when a cluster is partially
> >>> unhealthy, which in production is often the normal order of affairs.
> >>>
> >>>
> >>>
> > I like it. I hadn't thought about stressing quite this aggressively, but
> > now that I think about it, sounds like a great plan. Having some ballpark
> > measure to quantify the cost of a "backup-heavy" workload would be cool
> in
> > addition to seeing how the system reacts in unexpected manners.
> >
> > Sounds good to me.
> >>
> >> How will you test the restore aspect? After 1k (or whatever makes sense)
> >> incremental backups over the life of the chaos, could you restore and
> >> validate that the table had all expected data in place.
> >>
> >
> > Exactly. My thinking was that, at any point, we should be able to do a
> > restore and validate. Maybe something like: every Nth ITBLL iteration,
> make
> > a new backup point, restore a previous backup point, verify, restore to
> > newest backup point. The previous backup point should be a full or
> > incremental point.
> >
> > Vlad: I'm obviously curious to see what you think about this stuff, in
> > addition to what you already had in mind :)
> >
>

Re: [DISCUSS] Plan for Distributed testing of Backup and Restore

Posted by Andrew Purtell <ap...@apache.org>.
> making backup working in challenging conditions was not a goal of FT
design, correct failure handling was a goal.

Every real-world production environment has challenging conditions.

That said, making progress in the face of failures is only one aspect of
FT, and an equally valid one is that failures do not cause data corruption.

If testing with chaos proves this backup solution will fail if there is any
failure while backup is in progress, but at least it will successfully
clean up and not corrupt existing state - that could be ok, for some.
Possibly, us.

If testing with chaos proves this backup solution will not suffer
corruption if there is a failure *and* can still successfully complete if
there is any failure while backup is in progress - that would obviously
improve the perceived value proposition.

It would be fine to test this using hbase-it chaos facilities but with a
less aggressive policy than slowDeterministic that allows for backups to
successfully complete once in a while yet also demonstrate that when the
failures do happen things are properly cleaned up and data corruption does
not happen.




On Tue, Sep 12, 2017 at 11:25 AM, Vladimir Rodionov <vl...@gmail.com>
wrote:

> >> Vlad: I'm obviously curious to see what you think about this stuff, in
> addition to what you already had in mind :)
>
> Yes, I think that we need a test tool similar to ITBLL. Btw, making backup
> working in challenging conditions was not a goal of FT design, correct
> failure handling was a goal.
>
> On Tue, Sep 12, 2017 at 9:53 AM, Josh Elser <el...@apache.org> wrote:
>
> > Thanks for the quick feedback!
> >
> > On 9/12/17 12:36 PM, Stack wrote:
> >
> >> On Tue, Sep 12, 2017 at 9:33 AM, Andrew Purtell <
> andrew.purtell@gmail.com
> >> >
> >> wrote:
> >>
> >> I think those are reasonable criteria Josh.
> >>>
> >>> What I would like to see is something like "we ran ITBLL (or custom
> >>> generator with similar correctness validation if you prefer) on a dev
> >>> cluster (5-10 nodes) for 24 hours with server killing chaos agents
> >>> active,
> >>> attempted 1,440 backups (one per minute), of which 1,000 succeeded and
> >>> 100%
> >>> if these were successfully restored and validated." This implies your
> >>> points on automation and no manual intervention. Maybe the number of
> >>> successful backups under challenging conditions will be lower. Point is
> >>> they demonstrate we can rely on it even when a cluster is partially
> >>> unhealthy, which in production is often the normal order of affairs.
> >>>
> >>>
> >>>
> > I like it. I hadn't thought about stressing quite this aggressively, but
> > now that I think about it, sounds like a great plan. Having some ballpark
> > measure to quantify the cost of a "backup-heavy" workload would be cool
> in
> > addition to seeing how the system reacts in unexpected manners.
> >
> > Sounds good to me.
> >>
> >> How will you test the restore aspect? After 1k (or whatever makes sense)
> >> incremental backups over the life of the chaos, could you restore and
> >> validate that the table had all expected data in place.
> >>
> >
> > Exactly. My thinking was that, at any point, we should be able to do a
> > restore and validate. Maybe something like: every Nth ITBLL iteration,
> make
> > a new backup point, restore a previous backup point, verify, restore to
> > newest backup point. The previous backup point should be a full or
> > incremental point.
> >
> > Vlad: I'm obviously curious to see what you think about this stuff, in
> > addition to what you already had in mind :)
> >
>



-- 
Best regards,
Andrew

Words like orphans lost among the crosstalk, meaning torn from truth's
decrepit hands
   - A23, Crosstalk

Re: [DISCUSS] Plan for Distributed testing of Backup and Restore

Posted by Vladimir Rodionov <vl...@gmail.com>.
>> Vlad: I'm obviously curious to see what you think about this stuff, in
addition to what you already had in mind :)

Yes, I think that we need a test tool similar to ITBLL. Btw, making backup
working in challenging conditions was not a goal of FT design, correct
failure handling was a goal.

On Tue, Sep 12, 2017 at 9:53 AM, Josh Elser <el...@apache.org> wrote:

> Thanks for the quick feedback!
>
> On 9/12/17 12:36 PM, Stack wrote:
>
>> On Tue, Sep 12, 2017 at 9:33 AM, Andrew Purtell <andrew.purtell@gmail.com
>> >
>> wrote:
>>
>> I think those are reasonable criteria Josh.
>>>
>>> What I would like to see is something like "we ran ITBLL (or custom
>>> generator with similar correctness validation if you prefer) on a dev
>>> cluster (5-10 nodes) for 24 hours with server killing chaos agents
>>> active,
>>> attempted 1,440 backups (one per minute), of which 1,000 succeeded and
>>> 100%
>>> if these were successfully restored and validated." This implies your
>>> points on automation and no manual intervention. Maybe the number of
>>> successful backups under challenging conditions will be lower. Point is
>>> they demonstrate we can rely on it even when a cluster is partially
>>> unhealthy, which in production is often the normal order of affairs.
>>>
>>>
>>>
> I like it. I hadn't thought about stressing quite this aggressively, but
> now that I think about it, sounds like a great plan. Having some ballpark
> measure to quantify the cost of a "backup-heavy" workload would be cool in
> addition to seeing how the system reacts in unexpected manners.
>
> Sounds good to me.
>>
>> How will you test the restore aspect? After 1k (or whatever makes sense)
>> incremental backups over the life of the chaos, could you restore and
>> validate that the table had all expected data in place.
>>
>
> Exactly. My thinking was that, at any point, we should be able to do a
> restore and validate. Maybe something like: every Nth ITBLL iteration, make
> a new backup point, restore a previous backup point, verify, restore to
> newest backup point. The previous backup point should be a full or
> incremental point.
>
> Vlad: I'm obviously curious to see what you think about this stuff, in
> addition to what you already had in mind :)
>

Re: [DISCUSS] Plan for Distributed testing of Backup and Restore

Posted by Josh Elser <el...@apache.org>.
Thanks for the quick feedback!

On 9/12/17 12:36 PM, Stack wrote:
> On Tue, Sep 12, 2017 at 9:33 AM, Andrew Purtell <an...@gmail.com>
> wrote:
> 
>> I think those are reasonable criteria Josh.
>>
>> What I would like to see is something like "we ran ITBLL (or custom
>> generator with similar correctness validation if you prefer) on a dev
>> cluster (5-10 nodes) for 24 hours with server killing chaos agents active,
>> attempted 1,440 backups (one per minute), of which 1,000 succeeded and 100%
>> if these were successfully restored and validated." This implies your
>> points on automation and no manual intervention. Maybe the number of
>> successful backups under challenging conditions will be lower. Point is
>> they demonstrate we can rely on it even when a cluster is partially
>> unhealthy, which in production is often the normal order of affairs.
>>
>>

I like it. I hadn't thought about stressing quite this aggressively, but 
now that I think about it, sounds like a great plan. Having some 
ballpark measure to quantify the cost of a "backup-heavy" workload would 
be cool in addition to seeing how the system reacts in unexpected manners.

> Sounds good to me.
> 
> How will you test the restore aspect? After 1k (or whatever makes sense)
> incremental backups over the life of the chaos, could you restore and
> validate that the table had all expected data in place.

Exactly. My thinking was that, at any point, we should be able to do a 
restore and validate. Maybe something like: every Nth ITBLL iteration, 
make a new backup point, restore a previous backup point, verify, 
restore to newest backup point. The previous backup point should be a 
full or incremental point.

Vlad: I'm obviously curious to see what you think about this stuff, in 
addition to what you already had in mind :)

Re: [DISCUSS] Plan for Distributed testing of Backup and Restore

Posted by Stack <st...@duboce.net>.
On Tue, Sep 12, 2017 at 9:33 AM, Andrew Purtell <an...@gmail.com>
wrote:

> I think those are reasonable criteria Josh.
>
> What I would like to see is something like "we ran ITBLL (or custom
> generator with similar correctness validation if you prefer) on a dev
> cluster (5-10 nodes) for 24 hours with server killing chaos agents active,
> attempted 1,440 backups (one per minute), of which 1,000 succeeded and 100%
> if these were successfully restored and validated." This implies your
> points on automation and no manual intervention. Maybe the number of
> successful backups under challenging conditions will be lower. Point is
> they demonstrate we can rely on it even when a cluster is partially
> unhealthy, which in production is often the normal order of affairs.
>
>
Sounds good to me.

How will you test the restore aspect? After 1k (or whatever makes sense)
incremental backups over the life of the chaos, could you restore and
validate that the table had all expected data in place.

Thanks,
St.Ack



>
> > On Sep 12, 2017, at 9:07 AM, Josh Elser <el...@apache.org> wrote:
> >
> >> On 9/11/17 11:52 PM, Stack wrote:
> >> On Mon, Sep 11, 2017 at 11:07 AM, Vladimir Rodionov <
> vladrodionov@gmail.com>
> >> wrote:
> >>> ...
> >>> That is mostly it. Yes, We have not done real testing with real data
> on a
> >>> real cluster yet, except QA  testing on a small OpenStack
> >>> cluster (10 nodes). That is our probably the biggest minus right now. I
> >>> would like to inform community that this week we are going to start
> >>> full scale testing with reasonably sized data sets.
> >>>
> >> ... Completion of HA seems important as is result of the scale testing.
> >
> > I think we should knock out a rough sketch on what effective "scale"
> testing would look like since that is a very subjective phrase. Let me
> start the ball rolling with a few things that come to my mind.
> >
> > (interpreting requirements as per rfc2119)
> >
> > * MUST have >5 RegionServers and >1 Masters in play
> > * MUST have Non-trivial final data sizes (final data size would be >=
> 100's of GB)
> > * MUST have some clear pass/fail determination for correctness of B&R
> > * MUST have some fault-injection
> >
> > * SHOULD be a completely automated test, not require coordination of a
> human to executing commands.
> > * SHOULD be able to acquire operational insight (metrics) while
> performing operations to determine success of testing
> > * SHOULD NOT require manual intervention, e.g. working around known
> issues/limitations
> > * SHOULD reuse the IntegrationTest framework in hbase-it
> >
> > Since we have a concern of correctness, ITBLL sounds like a good
> starting point to avoid having to re-write similar kinds of logic.
> ChaosMonkey is always great for fault-injection.
> >
> > Thoughts?
>

Re: [DISCUSS] Plan for Distributed testing of Backup and Restore

Posted by Andrew Purtell <an...@gmail.com>.
I think those are reasonable criteria Josh. 

What I would like to see is something like "we ran ITBLL (or custom generator with similar correctness validation if you prefer) on a dev cluster (5-10 nodes) for 24 hours with server killing chaos agents active, attempted 1,440 backups (one per minute), of which 1,000 succeeded and 100% if these were successfully restored and validated." This implies your points on automation and no manual intervention. Maybe the number of successful backups under challenging conditions will be lower. Point is they demonstrate we can rely on it even when a cluster is partially unhealthy, which in production is often the normal order of affairs. 


> On Sep 12, 2017, at 9:07 AM, Josh Elser <el...@apache.org> wrote:
> 
>> On 9/11/17 11:52 PM, Stack wrote:
>> On Mon, Sep 11, 2017 at 11:07 AM, Vladimir Rodionov <vl...@gmail.com>
>> wrote:
>>> ...
>>> That is mostly it. Yes, We have not done real testing with real data on a
>>> real cluster yet, except QA  testing on a small OpenStack
>>> cluster (10 nodes). That is our probably the biggest minus right now. I
>>> would like to inform community that this week we are going to start
>>> full scale testing with reasonably sized data sets.
>>> 
>> ... Completion of HA seems important as is result of the scale testing.
> 
> I think we should knock out a rough sketch on what effective "scale" testing would look like since that is a very subjective phrase. Let me start the ball rolling with a few things that come to my mind.
> 
> (interpreting requirements as per rfc2119)
> 
> * MUST have >5 RegionServers and >1 Masters in play
> * MUST have Non-trivial final data sizes (final data size would be >= 100's of GB)
> * MUST have some clear pass/fail determination for correctness of B&R
> * MUST have some fault-injection
> 
> * SHOULD be a completely automated test, not require coordination of a human to executing commands.
> * SHOULD be able to acquire operational insight (metrics) while performing operations to determine success of testing
> * SHOULD NOT require manual intervention, e.g. working around known issues/limitations
> * SHOULD reuse the IntegrationTest framework in hbase-it
> 
> Since we have a concern of correctness, ITBLL sounds like a good starting point to avoid having to re-write similar kinds of logic. ChaosMonkey is always great for fault-injection.
> 
> Thoughts?