You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@solr.apache.org by Jason Gerlowski <ge...@gmail.com> on 2021/07/30 19:15:00 UTC

Fixing Inefficient Solr Operator Backups

Hey all,

I've been getting familiar in the last week or two with our new
operator, and noticed that the way its backups work will miss out on
the "incremental" efficiency improvements added recently as a part of
SIP-12.  For backups to be done incrementally, an ongoing backup has
to be able to "see" the files stored by previous backups so that it
knows which index files to skip over.  Our current operator support
does a few things that prevent this in practice:

- the operator "rm -rf"s all files at the backup location before
starting each new backup
- the operator requests each backup at a unique name/location.
- the operator compresses the backup file tree after finishing each backup

Everything will still work, the backups just won't be nearly as
efficient for many common usecases as they could be.

There's a few ways we could address this.

In one approach, we could leave 'solrbackup' mostly untouched. For
"incremental" situations, we would create a new resource-type
('solrbackupschedule'? 'solrbackuprepeating'?) that's explicitly
geared towards repeated backups of the same collections and knows to
store these all in the same location.  Conceivably it could also have
other useful ops features like cron-job-like scheduling of backups.
'solrbackupschedule' would then be our solution for users who want to
do recurring or repeated backups, and 'solrbackup' could be
repositioned in the docs as the solution for those doing an ad-hoc,
standalone backup.

Another approach would be to focus instead on adding configuration
options to 'solrbackup' that would make it suitable for incremental
backups: enable/disable backup compression, cleaning/retaining the
"location" prior to doing a backup, an override for the backup
location, etc.  'solrbackup' would remain the option for anyone doing
any sort of backup.  (Of course, we could also add a
solrbackupschedule resource-type as a layer on top of this if the idea
of cron-like backup triggering is appealing, but it could be
implemented in terms of managing 'solrbackup' sub-resources that
perform the actual "work".)

There are tradeoffs for both approaches IMO.

The first approach is simplest in terms of backcompat.  It may also
prove simplest in handling discrepancies between Solr versions
(incremental backups only supported in v8.9+).  But it leaves a
potential usecase gap: users may take backups frequently enough to
benefit from "incrementality", but without any sort of defined
schedule or set periodicity like a 'solrbackupschedule' resource might
require.  It also risks duplicating code as both 'solrbackup' and
'solrbackupschedule' would involve similar actions.

OTOH, the second approach is more flexible ('solrbackup' would become
suitable for any common backup usecase), and 'solrbackupschedule', if
created, has a really nice conceptual separation being implemented as
a level on top of 'solrbackup'.  But it pays for this all by making
'solrbackup' more complex and harder for a non-Solr-SME to "get right"
out of the box and opening some backcompat questions/challenges.
Lastly, it'd require us to think carefully about how cleanup and
resource-deletion works, since this approach will allow multiple
'solrbackup' resources to share a backup "location".

Anyone have any thoughts or preferences between those two options?  Or
some third approach I missed?  Or even general context around why our
operator backup support looks the way it does?  Really appreciate any
input!

Best,

Jason

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@solr.apache.org
For additional commands, e-mail: dev-help@solr.apache.org


Re: Fixing Inefficient Solr Operator Backups

Posted by Jason Gerlowski <ge...@gmail.com>.
> I'm a big fan of the second approach ... I think it's fine to make those assumptions.

I was leaning towards approach (2) as well, so that all sounds good to me.

I'd have to think a little more about what defaults make most sense
for the non-incremental use case before I could weigh in intelligently
there.  I think it probably ties into what the default will be for the
"incremental" option that you suggested.  If the default is
incremental=true, then I think it's safe to assume that someone
choosing non-incremental is fine blowing away any existing files,
doing compression at the end, etc.  But if non-incremental is the
quiet default, I'm less sure.

In any case, thanks for responding - having a general sanity check on
the approach gives me enough to get started!

Best,

Jason

On Mon, Aug 2, 2021 at 12:11 PM Houston Putman <ho...@gmail.com> wrote:
>
> Hey Jason, thanks for the thorough investigation here.
>
> I'm a big fan of the second approach, but in this case I think we'd really only need 1 option: incremental: true/false
>
> If the user specifies an incremental backup, we know that:
>
> They do not want a unique name
> The data already there should not be deleted
> The data should not be compressed
>
> I think it's fine to make those assumptions.
>
> However for the non-incremental use case, some of those options do come into play.
>
> I think deleting the existing data is fine, but please correct me if I'm wrong
> Compressing data by default should be fine? I see no reason not to, but we can always make this an option
> The unique name thing is fair, but if we do enable cron-scheduled backups, then we probably do want a unique name per-backup here.
>
> I think it's fine to change the default behavior going forward if it comes with a good reason, but for the incremental/non-incremental option
> I think a field in the CRD is by far the best option.
>
> - Houston
>
> On Fri, Jul 30, 2021 at 3:15 PM Jason Gerlowski <ge...@gmail.com> wrote:
>>
>> Hey all,
>>
>> I've been getting familiar in the last week or two with our new
>> operator, and noticed that the way its backups work will miss out on
>> the "incremental" efficiency improvements added recently as a part of
>> SIP-12.  For backups to be done incrementally, an ongoing backup has
>> to be able to "see" the files stored by previous backups so that it
>> knows which index files to skip over.  Our current operator support
>> does a few things that prevent this in practice:
>>
>> - the operator "rm -rf"s all files at the backup location before
>> starting each new backup
>> - the operator requests each backup at a unique name/location.
>> - the operator compresses the backup file tree after finishing each backup
>>
>> Everything will still work, the backups just won't be nearly as
>> efficient for many common usecases as they could be.
>>
>> There's a few ways we could address this.
>>
>> In one approach, we could leave 'solrbackup' mostly untouched. For
>> "incremental" situations, we would create a new resource-type
>> ('solrbackupschedule'? 'solrbackuprepeating'?) that's explicitly
>> geared towards repeated backups of the same collections and knows to
>> store these all in the same location.  Conceivably it could also have
>> other useful ops features like cron-job-like scheduling of backups.
>> 'solrbackupschedule' would then be our solution for users who want to
>> do recurring or repeated backups, and 'solrbackup' could be
>> repositioned in the docs as the solution for those doing an ad-hoc,
>> standalone backup.
>>
>> Another approach would be to focus instead on adding configuration
>> options to 'solrbackup' that would make it suitable for incremental
>> backups: enable/disable backup compression, cleaning/retaining the
>> "location" prior to doing a backup, an override for the backup
>> location, etc.  'solrbackup' would remain the option for anyone doing
>> any sort of backup.  (Of course, we could also add a
>> solrbackupschedule resource-type as a layer on top of this if the idea
>> of cron-like backup triggering is appealing, but it could be
>> implemented in terms of managing 'solrbackup' sub-resources that
>> perform the actual "work".)
>>
>> There are tradeoffs for both approaches IMO.
>>
>> The first approach is simplest in terms of backcompat.  It may also
>> prove simplest in handling discrepancies between Solr versions
>> (incremental backups only supported in v8.9+).  But it leaves a
>> potential usecase gap: users may take backups frequently enough to
>> benefit from "incrementality", but without any sort of defined
>> schedule or set periodicity like a 'solrbackupschedule' resource might
>> require.  It also risks duplicating code as both 'solrbackup' and
>> 'solrbackupschedule' would involve similar actions.
>>
>> OTOH, the second approach is more flexible ('solrbackup' would become
>> suitable for any common backup usecase), and 'solrbackupschedule', if
>> created, has a really nice conceptual separation being implemented as
>> a level on top of 'solrbackup'.  But it pays for this all by making
>> 'solrbackup' more complex and harder for a non-Solr-SME to "get right"
>> out of the box and opening some backcompat questions/challenges.
>> Lastly, it'd require us to think carefully about how cleanup and
>> resource-deletion works, since this approach will allow multiple
>> 'solrbackup' resources to share a backup "location".
>>
>> Anyone have any thoughts or preferences between those two options?  Or
>> some third approach I missed?  Or even general context around why our
>> operator backup support looks the way it does?  Really appreciate any
>> input!
>>
>> Best,
>>
>> Jason
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@solr.apache.org
>> For additional commands, e-mail: dev-help@solr.apache.org
>>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@solr.apache.org
For additional commands, e-mail: dev-help@solr.apache.org


Re: Fixing Inefficient Solr Operator Backups

Posted by Houston Putman <ho...@gmail.com>.
Hey Jason, thanks for the thorough investigation here.

I'm a big fan of the second approach, but in this case I think we'd really
only need 1 option: *incremental: true/false*

If the user specifies an incremental backup, we know that:

   - They do not want a unique name
   - The data already there should not be deleted
   - The data should not be compressed

I think it's fine to make those assumptions.

However for the non-incremental use case, some of those options do come
into play.

   - I think deleting the existing data is fine, but please correct me if
   I'm wrong
   - Compressing data by default should be fine? I see no reason not to,
   but we can always make this an option
   - The unique name thing is fair, but if we do enable cron-scheduled
   backups, then we probably do want a unique name per-backup here.

I think it's fine to change the default behavior going forward if it comes
with a good reason, but for the incremental/non-incremental option
I think a field in the CRD is by far the best option.

- Houston

On Fri, Jul 30, 2021 at 3:15 PM Jason Gerlowski <ge...@gmail.com>
wrote:

> Hey all,
>
> I've been getting familiar in the last week or two with our new
> operator, and noticed that the way its backups work will miss out on
> the "incremental" efficiency improvements added recently as a part of
> SIP-12.  For backups to be done incrementally, an ongoing backup has
> to be able to "see" the files stored by previous backups so that it
> knows which index files to skip over.  Our current operator support
> does a few things that prevent this in practice:
>
> - the operator "rm -rf"s all files at the backup location before
> starting each new backup
> - the operator requests each backup at a unique name/location.
> - the operator compresses the backup file tree after finishing each backup
>
> Everything will still work, the backups just won't be nearly as
> efficient for many common usecases as they could be.
>
> There's a few ways we could address this.
>
> In one approach, we could leave 'solrbackup' mostly untouched. For
> "incremental" situations, we would create a new resource-type
> ('solrbackupschedule'? 'solrbackuprepeating'?) that's explicitly
> geared towards repeated backups of the same collections and knows to
> store these all in the same location.  Conceivably it could also have
> other useful ops features like cron-job-like scheduling of backups.
> 'solrbackupschedule' would then be our solution for users who want to
> do recurring or repeated backups, and 'solrbackup' could be
> repositioned in the docs as the solution for those doing an ad-hoc,
> standalone backup.
>
> Another approach would be to focus instead on adding configuration
> options to 'solrbackup' that would make it suitable for incremental
> backups: enable/disable backup compression, cleaning/retaining the
> "location" prior to doing a backup, an override for the backup
> location, etc.  'solrbackup' would remain the option for anyone doing
> any sort of backup.  (Of course, we could also add a
> solrbackupschedule resource-type as a layer on top of this if the idea
> of cron-like backup triggering is appealing, but it could be
> implemented in terms of managing 'solrbackup' sub-resources that
> perform the actual "work".)
>
> There are tradeoffs for both approaches IMO.
>
> The first approach is simplest in terms of backcompat.  It may also
> prove simplest in handling discrepancies between Solr versions
> (incremental backups only supported in v8.9+).  But it leaves a
> potential usecase gap: users may take backups frequently enough to
> benefit from "incrementality", but without any sort of defined
> schedule or set periodicity like a 'solrbackupschedule' resource might
> require.  It also risks duplicating code as both 'solrbackup' and
> 'solrbackupschedule' would involve similar actions.
>
> OTOH, the second approach is more flexible ('solrbackup' would become
> suitable for any common backup usecase), and 'solrbackupschedule', if
> created, has a really nice conceptual separation being implemented as
> a level on top of 'solrbackup'.  But it pays for this all by making
> 'solrbackup' more complex and harder for a non-Solr-SME to "get right"
> out of the box and opening some backcompat questions/challenges.
> Lastly, it'd require us to think carefully about how cleanup and
> resource-deletion works, since this approach will allow multiple
> 'solrbackup' resources to share a backup "location".
>
> Anyone have any thoughts or preferences between those two options?  Or
> some third approach I missed?  Or even general context around why our
> operator backup support looks the way it does?  Really appreciate any
> input!
>
> Best,
>
> Jason
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@solr.apache.org
> For additional commands, e-mail: dev-help@solr.apache.org
>
>