You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@directory.apache.org by Kiran Ayyagari <ay...@gmail.com> on 2009/02/01 19:49:23 UTC

[DRS] thoughts about implementation

Hello guys,

     Here is an initial idea about implementation which I have in my mind

     HOWL has a feature called 'marking' in its log file (a.k.a journal). The idea is to use this as a checkpoint since
     the last successful disk write of the DIT data i.e whenever we perform a sync we put a mark in the journal.
     in case of a crash we can retrieve the data from journal from the marked position(using howl API),

     Currently the syncing of DIT data is a per partition based operation unless a call is made to
     DirectoryService's sync() method which internally calls the sync on PartitionNexus.

     IMO this marking of journal should happen in the DirectoryService's sync() operation.

     A change to the partition's sync() method to call DirectoryService's sync() which intern calls (each) partition's
     commit() (a new method) would help. A special flag like 'isDirty' in the partition will allow us to avoid calling
     commit() on each partition.

     Any other ideas about how best we can maintain a checkpoint/mark after *any* sync operation on DIT data?.

     Having said this, I have another issue, how do I detect the beginning of a corrupted
     entry in a JDBM file(all the DIT data is stored in these files)

     To put this in other way, if a JDBM file was synced at nth entry and server was crashed in the middle of
     writing n+1th entry I would like to start recovery from the end of nth record (a common idea I believe though)
     (haven't looked at jdbm code yet, but throwing this question anticipating a quick answer ;) )

     One idea comes to my mind as I write this is to read each entry from the beginning of the jdbm file using the
     serializer classes we provide for storing the pair (K,V)

     thoughts?

-- 
Kiran Ayyagari

Re: [DRS] thoughts about implementation

Posted by ayyagarikiran <ay...@gmail.com>.


Emmanuel Lecharny wrote:
> Hi,
> 
>>    IMO this marking of journal should happen in the DirectoryService's
>> sync() operation.
> 
> As the sync can occur on a quite big interval of time (let say 15 secs
> by default), if we depend on it to add a checkpoint, that mean we may
> lose 15 secs of modifications. It may be seen as accepteblae, but
> IMHO, the idea is to store logs as fast as possible, and when a
> modification is considered as done, then add a checkpoint. This will
> limitate the potential loss of information.
> 

sorry if my initial description gave a different perspective. The data *will* be written to journal as soon as an
operation(add/del/mod) is done, the 'mark' operation is just to put a checkpoint in the journal to tell us that the DIT
data present till this checkpoint is flushed to jdbm files on disk. So a journal *always* contains the data irrespective
of the state of jdbm file(s).

But I want to put the 'mark' only when jdbm files are synced so that we can figure out which records weren't flushed to
the jdbm file(s).

>>    A change to the partition's sync() method to call DirectoryService's
>> sync() which intern calls (each) partition's
>>    commit() (a new method) would help. A special flag like 'isDirty' in the
>> partition will allow us to avoid calling
>>    commit() on each partition.
>>
>>    Any other ideas about how best we can maintain a checkpoint/mark after
>> *any* sync operation on DIT data?.
> 
> when the mod is considered done (even if not written into the
> backend), we should consider the operation valid, and then write it on
> the log, and when done, adding a checkpoint. That should be done for
> every modification, whatever it costs.
> 
yep this is the case

> IMO, JDBM should be used to help access to modifications operation,
> but the master table should not contain the real data, but just a
> pointer to another file in which modification are written in a
> sequential way. We just keep an offset into the Mastertable. If
> something goes really wrong, we can rebuild the master table and all
> the indexes from this sequential file.
> 
yeah, it would be great if we can avoid having the complete data in a sequential file
and btw, the howl journal file holds some data in clear text format which might be a security concern
(yes, we do have cryptography but thats a different thing altogether ;) )

> This can be discussed, I'm just speaking my mind here, not imposing
> any solution.
> 

thanks Emmanuel

Kiran Ayyagari

Re: [DRS] thoughts about implementation

Posted by Emmanuel Lecharny <el...@apache.org>.

Hi,

On Sun, Feb 1, 2009 at 7:49 PM, Kiran Ayyagari <ay...@gmail.com> wrote:
> Hello guys,
>
>    Here is an initial idea about implementation which I have in my mind
>
>    HOWL has a feature called 'marking' in its log file (a.k.a journal). The
> idea is to use this as a checkpoint since
>    the last successful disk write of the DIT data i.e whenever we perform a
> sync we put a mark in the journal.
>    in case of a crash we can retrieve the data from journal from the marked
> position(using howl API),

This is the only way we can restore not corrupted data : everything
before a checkpoint is correct, everything after a checkpoint is
assumed to be potentially harmed.

>
>    Currently the syncing of DIT data is a per partition based operation
> unless a call is made to
>    DirectoryService's sync() method which internally calls the sync on
> PartitionNexus.
>
>    IMO this marking of journal should happen in the DirectoryService's
> sync() operation.

As the sync can occur on a quite big interval of time (let say 15 secs
by default), if we depend on it to add a checkpoint, that mean we may
lose 15 secs of modifications. It may be seen as accepteblae, but
IMHO, the idea is to store logs as fast as possible, and when a
modification is considered as done, then add a checkpoint. This will
limitate the potential loss of information.

>
>    A change to the partition's sync() method to call DirectoryService's
> sync() which intern calls (each) partition's
>    commit() (a new method) would help. A special flag like 'isDirty' in the
> partition will allow us to avoid calling
>    commit() on each partition.
>
>    Any other ideas about how best we can maintain a checkpoint/mark after
> *any* sync operation on DIT data?.

when the mod is considered done (even if not written into the
backend), we should consider the operation valid, and then write it on
the log, and when done, adding a checkpoint. That should be done for
every modification, whatever it costs.

>
>    Having said this, I have another issue, how do I detect the beginning of
> a corrupted
>    entry in a JDBM file(all the DIT data is stored in these files)
>
>    To put this in other way, if a JDBM file was synced at nth entry and
> server was crashed in the middle of
>    writing n+1th entry I would like to start recovery from the end of nth
> record (a common idea I believe though)
>    (haven't looked at jdbm code yet, but throwing this question anticipating
> a quick answer ;) )

Not an easy question. If you think about BDB, they are using a journal
which is used for what they call a 'catastrophic' recovery. Basically
what we need. If we store information into a JDBM database, then
writting into it might corrupt the database. What save us is that we
can stop the writting into this base for a while, saving a copy of it,
and restarting the operations.

IMO, JDBM should be used to help access to modifications operation,
but the master table should not contain the real data, but just a
pointer to another file in which modification are written in a
sequential way. We just keep an offset into the Mastertable. If
something goes really wrong, we can rebuild the master table and all
the indexes from this sequential file.

This can be discussed, I'm just speaking my mind here, not imposing
any solution.

-- 
Regards,
Cordialement,
Emmanuel Lécharny
www.iktek.com

Re: [DRS] thoughts about implementation

Posted by Emmanuel Lecharny <el...@apache.org>.

Alex Karasulu wrote:
> Hmm I see what you're thinking.  I think we're all having problems drawing
> distinctions between these various facilities in the server.  I know I have
> wavered myself.
>
> At first I was thinking we should have an extremely simple journal with
> markers tracking application of operations. Some conversations with Emmanuel
> then lead me to believe that using the CL as the journal was just as good.
> Now I feel it might not be such a good idea.  Let me list my thoughts:
>
> (1) The CL is highly indexed with several db files which means there will be
> many writes need to persist the record while keeping the CL and it's indices
> consistent.  Also the CL is deep inside the server in the interceptor chain
> and many things can go wrong while getting to that point, not to mention the
> processing time it takes to get there.
>
> (2) The CL should be used for auditing, versioning, snapshoting, and
> replication.  It is fast thanks to indices and will should all the
> operations that have succeeded in inducing changes.  It would get more
> complicated if we started using it to also capture operations before they
> have been applied.  The semantics would shift.
>
> (3) As you say the CL itself can get corrupted.  And for this reason it's
> not well suited as a journal for all (even in progress) operations.
>
> I'm seriously thinking the use of the CL for a journal is not a good
> decision. The journal needs to be fast and simple, doing only one thing and
> doing it fast and flawlessly.
>   
Seems like we are now converging on the same vision :) The journal must 
be as fast as possible, and dedicated to a very basic usage : DRS

For replication, we need a slightly more complex data structure, thus 
the CL.

But we can just see the CL as a bunch of indexes built on top of the 
journal.

-- 
--
cordialement, regards,
Emmanuel Lécharny
www.iktek.com
directory.apache.org

Re: [DRS] thoughts about implementation

Posted by Kiran Ayyagari <ay...@gmail.com>.

> I agree that *ONLY* change operations that have succeeded should be 
> logged into the CL.  But I think we at least need a marker in the 
> journal to track the horizon between completed and in progress 
> operations no?
IMO, this is the ideal case in the way a journal should be, but it requires transaction support.
HOWL has support for transactions but whether to use this support or not is still I haven't evaluated yet.
Another thing is some attributes'( like 'revisions' ) new value is only available after completing the
operation successfully and then storing in the CL.

> 
> Alex
> 

-- 
Kiran Ayyagari

Re: [DRS] thoughts about implementation

Posted by Emmanuel Lecharny <el...@apache.org>.

Alex Karasulu wrote:
> I agree that *ONLY* change operations that have succeeded should be logged
> into the CL.  But I think we at least need a marker in the journal to track
> the horizon between completed and in progress operations no?
>   
Not necessarily. It can be stored into the CL. The journal is just a 
safety net here, when we replay it, assuming we have started at a point 
we had a failure, then if an operation has failed (like we try to delete 
a non existing entry), it will fail again.

With the CL, we can obviously go a bit farther, and this this the goal, 
but in case the CL is totally screwed up, we can restart from the journal.


-- 
--
cordialement, regards,
Emmanuel Lécharny
www.iktek.com
directory.apache.org

Re: [DRS] thoughts about implementation

Posted by Howard Chu <hy...@symas.com>.

David Boreham wrote:
> Hi, just a voice from the trenches, having done this a few times and
> made some mistakes:
>
> Do not invent a new kind of database for the change log. Instead use the
> one you already have
> for entries, with some appropriate indexing enhancements to support the
> change log semantics required.
> i.e. each change log record is in fact an entry. (there may be a way to
> do it where the change log
> data is actually stored in the entries themselves, and absolutely no
> additional records are
> required, but this seems to take an extra high level of cunning to pull
> off).

Agreed. Which is why we have
http://highlandsun.com/hyc/drafts/draft-chu-ldap-logschema-xx.html

> But, please don't make work for yourself by doubling the kinds of
> database you have.
> Two databases also makes it hard to ensure transactional integrity
> between the main
> db and the change log (which is important to have).

Yep.

-- 
   -- Howard Chu
   CTO, Symas Corp.           http://www.symas.com
   Director, Highland Sun     http://highlandsun.com/hyc/
   Chief Architect, OpenLDAP  http://www.openldap.org/project/

Re: [DRS] thoughts about implementation

Posted by Emmanuel Lecharny <el...@apache.org>.

David Boreham wrote:
> Hi, just a voice from the trenches, having done this a few times and 
> made some mistakes:
>
> Do not invent a new kind of database for the change log. Instead use 
> the one you already have
> for entries, with some appropriate indexing enhancements to support 
> the change log semantics required.
> i.e. each change log record is in fact an entry. (there may be a way 
> to do it where the change log
> data is actually stored in the entries themselves, and absolutely no 
> additional records are
> required, but this seems to take an extra high level of cunning to 
> pull off).
That's an interesting idea. Why should we store twice what we already 
have in the backend, if the modification has successfully been applied ? 
Now, how do we deal with deletes, and modifications, that's another story.

We are thinking about storing the LDAP request into the CL, not the 
entire entry, and we also want to store the controls. At some point, to 
avoid useless encoding/decoing, we also thought about storing the PDU only.

If you mean that we should store all the modifications into the backend, 
as if they were standard LDAP entries, sure, we can. We just have to 
bypass many of the internal logic (like schema checking and such). We 
are not that far in the implementation right now :)
>
> But, please don't make work for yourself by doubling the kinds of 
> database you have.
> Two databases also makes it hard to ensure transactional integrity 
> between the main
> db and the change log (which is important to have).
We don't have transaction integrity anyway :) But yes, we need to 
guarantee that an operation taht failed is marked as such in the CL, and 
that a successful operation is also tagged as such in the CL. Everything 
in between is just bad.


-- 
--
cordialement, regards,
Emmanuel Lécharny
www.iktek.com
directory.apache.org

Re: [DRS] thoughts about implementation

Posted by David Boreham <da...@bozemanpass.com>.

Hi, just a voice from the trenches, having done this a few times and 
made some mistakes:

Do not invent a new kind of database for the change log. Instead use the 
one you already have
for entries, with some appropriate indexing enhancements to support the 
change log semantics required.
i.e. each change log record is in fact an entry. (there may be a way to 
do it where the change log
data is actually stored in the entries themselves, and absolutely no 
additional records are
required, but this seems to take an extra high level of cunning to pull 
off).

But, please don't make work for yourself by doubling the kinds of 
database you have.
Two databases also makes it hard to ensure transactional integrity 
between the main
db and the change log (which is important to have).

Re: [DRS] thoughts about implementation

Posted by Alex Karasulu <ak...@gmail.com>.

On Tue, Feb 3, 2009 at 2:19 PM, Kiran Ayyagari <ay...@gmail.com>wrote:

>
> Hi Alex,
>
>  I'm seriously thinking the use of the CL for a journal is not a good
>> decision. The journal needs to be fast and simple, doing only one thing and
>> doing it fast and flawlessly.
>>
> +1, by 'replica of CL' I mean the journal contains the same data what CL
> stores in its store minus the indices and journal just writes that data in a
> sequential order (CL could be a B-Tree).

Right I agree the CL is fine as it is with the B-Tree and indices.  The
journal need not be a btree if all we need is sequential access.  We merely
need a simple pointer or two into it to know where the last operation was
successfully processed.  Hence everything earlier or after it in the file
represents currently active operations.

>
> However an assumption I made here was that each valid operation succeeds at
> least before storing its data in CL
> ( we have to write the same data to journal before storing in CL) this way
> we no longer need any marking operation cause
> if the master db gets corrupted we use CL to restore and if the CL also
> gets corrupted then we can restore the CL from
> journal(start from the beginning hence no intermediate 'marks') and then
> restore the master db.
>

I agree that *ONLY* change operations that have succeeded should be logged
into the CL.  But I think we at least need a marker in the journal to track
the horizon between completed and in progress operations no?

Alex

Re: [DRS] thoughts about implementation

Posted by Kiran Ayyagari <ay...@gmail.com>.

Hi Alex,

> I'm seriously thinking the use of the CL for a journal is not a good 
> decision. The journal needs to be fast and simple, doing only one thing 
> and doing it fast and flawlessly.
+1, by 'replica of CL' I mean the journal contains the same data what CL stores in its store minus the indices and 
journal just writes that data in a sequential order (CL could be a B-Tree).
However an assumption I made here was that each valid operation succeeds at least before storing its data in CL
( we have to write the same data to journal before storing in CL) this way we no longer need any marking operation cause
if the master db gets corrupted we use CL to restore and if the CL also gets corrupted then we can restore the CL from
journal(start from the beginning hence no intermediate 'marks') and then restore the master db.


Kiran Ayyagari

> 
> Alex

Re: [DRS] thoughts about implementation

Posted by Alex Karasulu <ak...@gmail.com>.

Hi Kiran,

On Mon, Feb 2, 2009 at 12:28 PM, Kiran Ayyagari <ay...@gmail.com>wrote:

> hi Alex,
>
>
>> Again like I said it's not this simple. I think JDBM API's start to fail
>> overall on corruption depending on how the corruption impacts accessing the
>> BTree. One bad access can cause access to half the entries to fail.
>>
> yeah, JDBM fails even if a single byte in the end changes (I just removed a
> character using vi editor then started the server, it barfed ;) )
>
>>
>> I think you're idea would work very well if the journal was well
>> integrated with JDBM at the lowest level.
>
> don't think I understood the 'lowest level' completely
>

For example you need to know when a page goes bad and repair the page.  You
need to be involved with the structures deep in the library to detect
problems in them.  For example you added a byte in the end of a db file and
everything got screwed up. To recover from something like this would require
adding some code to the RecordManager, BlockIO and PageHeader classes.  You
basically need to integrate your Journal code into the JDBM library.

>
> and one more thing, I think this journal should be a replica of CL (CL is
> more robust with indices though) this way we can even recover a crashed CL
>

Hmm I see what you're thinking.  I think we're all having problems drawing
distinctions between these various facilities in the server.  I know I have
wavered myself.

At first I was thinking we should have an extremely simple journal with
markers tracking application of operations. Some conversations with Emmanuel
then lead me to believe that using the CL as the journal was just as good.
Now I feel it might not be such a good idea.  Let me list my thoughts:

(1) The CL is highly indexed with several db files which means there will be
many writes need to persist the record while keeping the CL and it's indices
consistent.  Also the CL is deep inside the server in the interceptor chain
and many things can go wrong while getting to that point, not to mention the
processing time it takes to get there.

(2) The CL should be used for auditing, versioning, snapshoting, and
replication.  It is fast thanks to indices and will should all the
operations that have succeeded in inducing changes.  It would get more
complicated if we started using it to also capture operations before they
have been applied.  The semantics would shift.

(3) As you say the CL itself can get corrupted.  And for this reason it's
not well suited as a journal for all (even in progress) operations.

I'm seriously thinking the use of the CL for a journal is not a good
decision. The journal needs to be fast and simple, doing only one thing and
doing it fast and flawlessly.

Alex

Re: [DRS] thoughts about implementation

Posted by Kiran Ayyagari <ay...@gmail.com>.

hi Alex,


> Again like I said it's not this simple. I think JDBM API's start to fail
> overall on corruption depending on how the corruption impacts accessing the
> BTree. One bad access can cause access to half the entries to fail.
>
yeah, JDBM fails even if a single byte in the end changes (I just removed a
character using vi editor then started the server, it barfed ;) )

>
> I think you're idea would work very well if the journal was well integrated
> with JDBM at the lowest level.

don't think I understood the 'lowest level' completely

and one more thing, I think this journal should be a replica of CL (CL is
more robust with indices though) this way we can even recover a crashed CL

wdyt?

Kiran Ayyagari

>
>
> Regards,
> Alex
>
>

Re: [DRS] thoughts about implementation

Posted by Alex Karasulu <ak...@gmail.com>.

Hi Kiran,

On Sun, Feb 1, 2009 at 1:49 PM, Kiran Ayyagari <ay...@gmail.com>wrote:

> Hello guys,
>
>    Here is an initial idea about implementation which I have in my mind
>
>    HOWL has a feature called 'marking' in its log file (a.k.a journal). The
> idea is to use this as a checkpoint since
>    the last successful disk write of the DIT data i.e whenever we perform a
> sync we put a mark in the journal.
>    in case of a crash we can retrieve the data from journal from the marked
> position(using howl API),
>
>    Currently the syncing of DIT data is a per partition based operation
> unless a call is made to
>    DirectoryService's sync() method which internally calls the sync on
> PartitionNexus.
>
>    IMO this marking of journal should happen in the DirectoryService's
> sync() operation.
>
>    A change to the partition's sync() method to call DirectoryService's
> sync() which intern calls (each) partition's
>    commit() (a new method) would help. A special flag like 'isDirty' in the
> partition will allow us to avoid calling
>    commit() on each partition.
>
>    Any other ideas about how best we can maintain a checkpoint/mark after
> *any* sync operation on DIT data?.
>
>    Having said this, I have another issue, how do I detect the beginning of
> a corrupted
>    entry in a JDBM file(all the DIT data is stored in these files)
>

The problem with JDBM file corruption is that you loose everything. I don't
think the dot.db file is recoverable and needs to be rebuilt.  From my
impressions from user issues due to corruption and past experiences when the
file is corrupt the whole file is lost.  It's not a single record in the db
file that is bad.  So the entire file needs to be reconstructed.

If the file is an index this is recoverable.  If it's the master.db then we
have a serious disaster.  In this case the entire changelog must be used to
rebuild the master.

>
>    To put this in other way, if a JDBM file was synced at nth entry and
> server was crashed in the middle of
>    writing n+1th entry I would like to start recovery from the end of nth
> record (a common idea I believe though)
>    (haven't looked at jdbm code yet, but throwing this question
> anticipating a quick answer ;) )
>

Again like I said it's not this simple. I think JDBM API's start to fail
overall on corruption depending on how the corruption impacts accessing the
BTree. One bad access can cause access to half the entries to fail.

I think you're idea would work very well if the journal was well integrated
with JDBM at the lowest level.

Regards,
Alex