You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@directory.apache.org by Emmanuel Lecharny <el...@gmail.com> on 2008/11/20 15:27:40 UTC

Backend recovery

Hi guys,

as I'm currently working on a recovery tool, I have reviewed the current 
configuration we have. Let's first analyse how a recovery tool is good 
for and how it should work.

1) Usage
We need to rebuild the indices if the database is corrupted. This can 
occur in many occasions, mainly when the server is brutally interupted 
(power shutdown, for instance). As indices are dynamically built while 
injecting new entries (or when updating those entries), it is important 
to have such a tool to rebuild them assuming the indices are not up to 
date, as otherwise we may have orphans entries, or even worst, indices 
pointing on non existing entries or wrong entries.

Another usage for such aa tool would be to create indices off line. This 
has the great advantage to allow a mass injection of entries into the 
master table (the table containing all the entries), then do a global 
re-indexing, potentially avoiding a lot of expensive controls, as we may 
run a pre-check on data before doing the injection. (of course, if we 
leverage the number of checks to do, we can go faster or slower, 
depending on how much we trust the data we inject into the server)

2) How it should work
This is what we have a real problem. Assuming that the backend storage 
can be totally FU, we can't really trust it to rebuild the indices. What 
are our options ? Let's see how the system works when we modify some 
entries :
- the database is in a S0 state at t0 (let's say when we have started 
the server the very first time)
- each added entry change the current state. As we may have a differed 
write, this state is written on the disk only every N seconds (unless 
the differed write is not activated)
- at some point, we are in state Si, and we have some more modifications
- then we have a crash. It can occurs before the current modifications 
are written on disk (and we have lost every modification since state Si 
(a), or, worse, it occurs while we are writing those modifications, and 
now, we can have a totally unstable base (b).

Case (a) can be handled, as the base is not corrupted. The problem is 
that we have lost some data, which may be a problem. However, if we 
don't use differed writes, we can avoid such a case (except for the last 
modification), with the major inconvenient that we are more likely to 
fell in case (b) now...

Case (b) is more problematic, because we have no way to determinate 
which was the previous stable state (Si), and to restore this state, as 
the base has already been partially modified.

What is the solution ? We have to assume that the state Si can be 
restored, and that we can apply every modification on it. The only way 
to do that is to combine two techniques :
- backup the base on regular basis, assuming that it can be done without 
allowing any update on this base during the backup (and this is not 
obvious, as the base can be pretty big)
- and store every modification into a journal, to be able to replay them 
on the restored base.

(keep in mind that we are not using a transactional system).

3) How does it translate for ADS ?

ADS has a ChangeLog interceptor which logs all the modified entries as a 
list of modifications. Each change has a unique revision (to be fixed, 
as we are not using a synchronized counter atm), and can be stored into 
a sequential text file, each modification being stored as a LDIF change 
operation( we currently don't have a file based storage).

The idea would be to add some way to save the underlying files, and then 
to apply all the stored logs on these files. We can even replay the 
whole journal from day one, but this will be totally overkilling. A 
third option, IMHO way better, as it eliminate the need to do a backup, 
would be to apply the log on a separate base, so that we don't need to 
do backups on the fly (of course, before applying the current logs, we 
should backup the spare files, and when the logs has been applied, ditch 
the N-1 backup). Here is the algorithm :

Journal last position is N, we have n modifications since then
Spare current base is version N

- time to apply the journal ! Mark the current position in the journal 
as N+1
- copy the spare base (Spare-N) to Spare-(N+1)
- apply the journal from position N to position N+1
- if everything is ok, tell the system that the current backup base is 
Spare-(N+1) and that the current log position is N+1
- now, we can ditch the Spare-N base

Journal last position is N+1 now, and we may have m modifications since 
then, as the server continue to log
Spare current base is version N+1

In order to get this working in the server, we have thinsg to do :
- decouple the Log sync from the partition sync (currently both are 
written at the same time)
- extend the ChangeLog to write in a flat file, injecting LDIF into it
- add a thread to implement the previous algorithm
- add a handler to run the previous algorithm if the server has crashed, 
when the server is restarted
- add a CL option to run the algo offline

A second advantage would be to allow a bulk load without all the time 
consumming controls we have in the server.

Thoughts ?

-- 
--
cordialement, regards,
Emmanuel Lécharny
www.iktek.com
directory.apache.org

Re: Backend recovery

Posted by Emmanuel Lecharny <el...@gmail.com>.

Hi Stefan,

>> Case (b) is more problematic, because we have no way to determinate
>> which was the previous stable state (Si), and to restore this state, as
>> the base has already been partially modified.
> Does JDBM provide tools to recover the latest stable state?

AFAIK, only if you are using the transactional mode. And as there is
no documentation ... :/

<snip/>
>> - and store every modification into a journal, to be able to replay them
>> on the restored base.
> Do I understand right that you want to write and sync the journal (the
> changelog LDIF) immediately but the partition only from time to time?

yes.

> I think that makes only sense if writing the journal is cheaper and faster
> than writing the partition data.

It should be way faster, as we just add the new entry at the end of
the file, without creating any index, nor modifying the master table.
We also don't have to manage a cache. However, we need to have a kind
of synchronization : each entry must be stored into a concurrent queue
before being written to sdisk. We also have to flush the entries to
the disk for real, otherwise the underlying OS may differ the write
too.

>
> Hm, that is similar to journaling file systems, right?

yep.

>
>>
>> the N-1 backup). Here is the algorithm :
>>
>> Journal last position is N, we have n modifications since then
>> Spare current base is version N
>>
>> - time to apply the journal ! Mark the current position in the journal
>> as N+1
>> - copy the spare base (Spare-N) to Spare-(N+1)
>> - apply the journal from position N to position N+1
>> - if everything is ok, tell the system that the current backup base is
>> Spare-(N+1) and that the current log position is N+1
> Ideally this should be an atomic operation.

Absolutely, but I see no way to do that. At best, it will be a kind of
two phase commit. The most important point is to have a coherent
system even if there is a crash at the worst moment.

Thanks !
-- 
Regards,
Cordialement,
Emmanuel Lécharny
www.iktek.com

Re: Backend recovery

Posted by Stefan Seelmann <se...@apache.org>.

Hi Emmanuel

Emmanuel Lecharny wrote:
> Hi guys,
> 
> as I'm currently working on a recovery tool, I have reviewed the current
> configuration we have. Let's first analyse how a recovery tool is good
> for and how it should work.
> 
> 1) Usage
> ...
Makes sense.

> 2) How it should work
> ...
> (a), or, worse, it occurs while we are writing those modifications, and
> now, we can have a totally unstable base (b).
> ...
> Case (b) is more problematic, because we have no way to determinate
> which was the previous stable state (Si), and to restore this state, as
> the base has already been partially modified.
Does JDBM provide tools to recover the latest stable state?

> 
> What is the solution ? We have to assume that the state Si can be
> restored, and that we can apply every modification on it. The only way
> to do that is to combine two techniques :
> - backup the base on regular basis, assuming that it can be done without
> allowing any update on this base during the backup (and this is not
> obvious, as the base can be pretty big)
> - and store every modification into a journal, to be able to replay them
> on the restored base.
Do I understand right that you want to write and sync the journal (the
changelog LDIF) immediately but the partition only from time to time? I
think that makes only sense if writing the journal is cheaper and faster
than writing the partition data.

Hm, that is similar to journaling file systems, right?

> 
> the N-1 backup). Here is the algorithm :
> 
> Journal last position is N, we have n modifications since then
> Spare current base is version N
> 
> - time to apply the journal ! Mark the current position in the journal
> as N+1
> - copy the spare base (Spare-N) to Spare-(N+1)
> - apply the journal from position N to position N+1
> - if everything is ok, tell the system that the current backup base is
> Spare-(N+1) and that the current log position is N+1
Ideally this should be an atomic operation.

Sound good.

Regards,
Stefan