You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@directory.apache.org by Ca...@ibs-ag.com on 2015/09/23 21:41:47 UTC

ApacheDS with Mavibot anytime soon?

Hi,

We have one customer that wants to get replication implemented. We've set it up in house and it seems to work well.
The trouble is there are 10's of thousands of users and all sorts of concurrent activity. They're on M16 and a few times per month, we get the 'ERR_554 double get for block' and then we need to restore.
We have an automated way of doing this now so it's not too disruptive when it happens.

My question is, would adding replication just add more concurrent activity to the servers and lead more frequently to the ERR_554 situation?

Also he's asked to upgrade to M20 but my hunch is that for this issue it won't matter much because we're still on JDBM?

Finally, we're eager to test out a release with the Mavibot backed. I'm sure it's possible to build ourselves but we're trying to keep the moving parts to a minimum
and use only ApacheDS releases.

Thanks for all your excellent work
Carlo

Re: ApacheDS with Mavibot anytime soon?

Posted by Emmanuel Lécharny <el...@gmail.com>.

Le 25/03/16 17:21, Ashma Shrestha a écrit :
> Carlos,
>
> Thanks for the help.
>
> @Emmanuel - I know you have a busy schedule however, is there a timeline to
> when we can expect a fix? Currently we are working a system which might
> make this scenario of concurrent update and search occur more.
No, saldy not, I don't have any timeframe.

My dayjob is totally killing me atm, and it's the very same for Kiran.
We would LOVE finding people capable of giving some hands... The other
option would be for my company to let me spent one month on that, but
they are not philantropists ;-)

Re: ApacheDS with Mavibot anytime soon?

Posted by Ashma Shrestha <as...@gmail.com>.

Carlos,

Thanks for the help.

@Emmanuel - I know you have a busy schedule however, is there a timeline to
when we can expect a fix? Currently we are working a system which might
make this scenario of concurrent update and search occur more.

Ashma

On Thu, Mar 24, 2016 at 6:02 PM, Accorsi, Carlo <Ca...@siemens.com>
wrote:

> HI Ashma,
> We have all our directory content dumped out as an LDIF on a periodic
> schedule. If the system crashes, we delete the partition folders on the
> file system, restart the server and  re-inject the entries from the ldif.
>
> If your directory is small < 1000 entries, you can just get them perhaps
> all in one search result.
> Otherwise use a paged result to loop through each entry and call:
>
> String ldif = LdifUtils.convertToLdif(entry)
>
> and write / append the string value to a file. This is your backup ldif
> file.
>
> Hope this helps..
>
>
>
>
> -----Original Message-----
> From: Ashma Shrestha [mailto:ashma.shrestha@gmail.com]
> Sent: Thursday, March 24, 2016 4:38 PM
> To: users@directory.apache.org
> Subject: Re: ApacheDS with Mavibot anytime soon?
>
> Hi Carlos,
>
> You have mentioned "we get the 'ERR_554 double get for block' and then we
> need to restore. We have an automated way of doing this now so it's not too
> disruptive when it happens." . I am running into the same ERR_554 issue,
> can you let me know how you fixed it?
>
> Thanks,
>
> Ashma
>
> On Mon, Sep 28, 2015 at 9:15 AM, <Ca...@ibs-ag.com> wrote:
>
> > Emmanuel! Wow!! Thank you for that most detailed and comprehensive
> > write up of the concurrency problem you are working to solve with
> > Mavibot.  The design sounds world class.
> > I now fully appreciate the complexity of this transaction issue and
> > the ASCII art helped make it clear. We realize you're all doing your
> > best to devise a robust solution for this and so we'll sit tight for now.
> > I appreciate the time you took to explain this to us. Have fun at the
> > conference.
> > Best, Carlo Accorsi
> >
> >
> >
> >
> > -----Original Message-----
> > From: Emmanuel Lécharny [mailto:elecharny@gmail.com]
> > Sent: Sunday, September 27, 2015 1:23 PM
> > To: users@directory.apache.org
> > Subject: Re: ApacheDS with Mavibot anytime soon?
> >
> > Le 23/09/15 21:41, Carlo.Accorsi@ibs-ag.com a écrit :
> > > Hi,
> >
> > hi,
> >
> > sorry for the delayed response ! I saw your mail 3 days ago, but I
> > forgot to answer back then, being caught in a storm of delaying events...
> > >
> > > We have one customer that wants to get replication implemented.
> > > We've
> > set it up in house and it seems to work well.
> > > The trouble is there are 10's of thousands of users and all sorts of
> > concurrent activity. They're on M16 and a few times per month, we get
> > the
> > 'ERR_554 double get for block' and then we need to restore.
> > > We have an automated way of doing this now so it's not too
> > > disruptive
> > when it happens.
> > Hopefully. Still I understand this is not convenient.
> > >
> > > My question is, would adding replication just add more concurrent
> > activity to the servers and lead more frequently to the ERR_554
> situation?
> > For the consumer, it will be just as if the updates were coming from a
> > standard user. So, no, I don't think it's a real problem.
> > >
> > > Also he's asked to upgrade to M20 but my hunch is that for this
> > > issue it
> > won't matter much because we're still on JDBM?
> >
> > True. It's just that some bugs have been fixed in M20.
> > >
> > > Finally, we're eager to test out a release with the Mavibot backed.
> > > I'm sure it's possible to build ourselves but we're trying to keep
> > > the
> > moving parts to a minimum and use only ApacheDS releases.
> > I understand.
> >
> > Mavibot work has been delayed, and we are working on it as much as we
> > can, but it's not enough. Kiran and I have day jobs atm that forbid us
> > to dedicate as much time as would be necessary to complete the last
> > part that is mandatory to get this problem fixed : the transaction
> > system. We restarted to 'think' about the best implementation 2 weeks
> > ago (on your free time) and we expect to be able to spend more time on
> > it next week, as we will both be at Budapest, during the Apache
> Conference.
> >
> > Transaction  support (the way we need it, ie cross-btree) is not that
> > simple to impelment. We have a pretty clear idea on how it should
> > work, but that mpacts the current design, as we need to woirk in
> > memory, and avoid to copy pages that have already been modifed in the
> > transaction boundaries. We also have to deal with the cleanup of old
> > versions, which also means we need to implement the version manager.
> >
> > To give you a insight on what I'm coming for, here is a draft of my
> > thoughts about this transaction manager (note that this is my vision,
> > that I haven't shared yet, because I don't think it's covering all the
> > moving parts, but still, I think it won't be that different to what we
> > will end with ) :
> >
> > Transaction support
> > -------------------
> >
> > Mavibot must support transaction across many B-tres (ie, whatever the
> > number of b-trees we are updating, we should wait before committing
> > the changes up to the point the transaction is closed). That means we
> > may have many updates pending in memory. In any case, if a rollback is
> > done, the modified b-trees will remain intact, in the same state they
> > were before the transaction started.
> >
> > To achive this, we need to work in a working memory. As we may have to
> > fetch pages that are to be updated, we will face three cases :
> > - the page is not in memory :
> > we have to fetch them from disk, update them and store it in the
> > working memory we can read it back if needed
> > - the page is in the cache :
> > we have to copy it to the working memory
> > - the page is in the working memory : we simply update it, leaving it
> > in the working memory.
> >
> > So if we have to modify the same page many times, it will be updated
> > in the working memory, we won't need to copy it.
> >
> > That actually change the current Mavibot implementation, as it's
> > copying pages no matter what. Here, if the page is coming from the
> > working memory, we just update it.
> >
> >
> > Layout
> > ------
> >
> >  Disk            Cache           Working
> >   ___                            Memory
> >  /   \          .------.
> > +     +        /      /|         .------.
> > |\___/|       .------. |       .------. |
> > +     +       |      | |       |      | |
> > |\___/| ----> |      | | ----> |      | |
> > +     +       |      |/        |      |/
> > |\___/|       .------.         .------.
> > +     +
> >  \___/
> >
> >  When the transaction is rollbacked, we simply delete the content of
> > the working memory.
> >  When the transaction is committed, we write all the pages in the
> > working memory on disk, and into the cache, we update the header, and
> > we purge the working memory.
> >
> >  If a crash occurs during a transaction, then either the transaction
> > is rollbacked (if the process is still running) - which is just about
> > purging the working memory -, or there will be nothing to do if we
> > have to restart the application, as the modified pages were in memory
> > nad have been removed while the process crashed.
> >
> > OTOH, when we read, we must be sure that the B-trees are present, and
> > that we always use the B-tree revisions that are correlated. This is
> > critical when using LDAP, where many B-trees are updated during a sinlge
> write.
> >
> > The way to do that is to use a revision, and always use the B-trees
> > which revision is just below or equal to this revision. Let's see with
> > an example
> >
> >  +-----+---+---+---+---+---+---+---+...
> >  |\ Rev|   |   |   |   |   |   |   |
> >  | \   |   |   |   |   |   |   |   |
> >  |  \  | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8
> >  |B- \ |   |   |   |   |   |   |   |
> >  |tree\|   |   |   |   |   |   | L | N
> >  +-----+---+---+---+---+---+---+---+...
> >  | aaa | X |   |   | X | X | X |   |
> >  |     | 1 | 1 | 1 | 4 | 5 | 6 | 6 |
> >  +-----+---+---+---+---+---+---+---+...
> >  | bbb | X | X | X |   | X | X | X |
> >  |     | 1 | 2 | 3 | 3 | 5 | 6 | 7 |
> >  +-----+---+---+---+---+---+---+---+...
> >  | ccc | X |   | X | X | X |   | X |
> >  |     | 1 | 1 | 3 | 4 | 5 | 5 | 7 |
> >  +-----+---+---+---+---+---+---+---+...
> >  | ddd | X | X |   |   | X |   |   |
> >  |     | 1 | 2 | 2 | 2 | 5 | 5 | 5 |
> >  +-----+---+---+---+---+---+---+---+...
> >  | eee | X |   | X | X |   | X |   |
> >  |     | 1 | 1 | 3 | 4 | 4 | 6 | 6 |
> >  +-----+---+---+---+---+---+---+---+...
> >
> > L : Latest revision
> > N : New revision
> >
> > In this table, we have had 7 updates, and 7 revsions. As all the
> > B-trees haven't been updated, we may have some 'holes' in the revision
> > associated to a B-tree. Typically, here, the 'aaa' B-tree will have 4
> > revisions (1, 4,
> > 5 and 6) created, because this B-tree has been updated
> > 4 times.
> >
> > Doing a read on revision 7 will use the following B-trees : aaa[6],
> > bbb[7], ccc[7], ddd[5], eee[6].
> >
> >
> > All in all, a read is associated to a global transaction revision
> > which is used as a marker. Every B-tree update is done using the new
> revision.
> >
> > Managing on-going reads
> > -----------------------
> >
> > We maintain all the B-tree revisions associated with a read, until the
> > read is completed, we need to get rid of the old revision when we are
> done.
> > This is critical to keep the database size to a minimum. In order to
> > do that, we must know if an older revision is still in use. We then
> > need to keep a common data structure that is shared across threads,
> > which will hold all the used revisions, ordered. With the previous
> > example, assuming revisions 2, 4 and 6 are in use, we may use a linked
> list for that purpose :
> >
> >    +---+   +---+   +---+
> > o--| 2 |<--| 4 |<--| 6 |
> >    +--2+   +--1+   +--3+
> >
> > When the read using rev 4 is done, we remove it from this list :
> >           +-------------+
> >    +---+  |    +---+    | +---+
> > o--| 2 |<-+  X-| 4 |    +-| 6 |
> >    +--2+       +--0+      +--3+
> >
> > The two remaining revisions are still in use, and one of them is older.
> > We can clean the B-tree pages associated with revision 4 if there is
> > already a new revision for this B-tree, and for the pages that have
> > been copied.
> >
> > When the read using rev 2 is done, we can clean all the B-tree pages
> > that are associated with every revision below 6, as soon as there is a
> > B-tree revision above the oldest one.
> >
> > An other important point : each revision might be used more than once.
> > Releasing a read will decrement the count, and when it goes down to 0,
> > we can safely remove the element from the list. We need a thread-safe
> > data structure for that.
> >
> > Managing the list
> > -----------------
> >
> > let's go a bit deeper in the management of this list. It has
> > interesting characteristics :
> >
> > - elements are added by a unique thread : the writter
> > - elements are not removed unless the count becomes 0
> > - the count is equal to the maximum number of thread that can use the
> > element at some point
> > - it's not because the counter is N that N thread *are* using this
> > element
> > : we may eventually have no thread using it ! (but they *may* come
> > back
> > later...)
> > - each element must be be protected agains concurrent update (which is
> > a different requirement than protecting the full list)
> > - any thread requiring an element in the list will always use the head
> > of the least (which is the most recent)
> > - we don't do any list traversal ! Once a thread uses an element, it
> > remembers about its position
> > - no user Thread is allowed to remove an element from this list. The
> > only thread that can do that is the Cleaner Thread
> >
> >
> > The last characteristic is interesting : that means we can safely
> > depends on one single thread to do the cleaning, so there is no
> > concurrent access on teh list while deleting elements.
> >
> > The deletion can be done in two ways :
> > - from the tail of the list, if the element's counter is equal to zero
> > (and we can iterate)
> > - from the inner of the list, when the element's counter to be removed
> > is zero
> >
> > The first way is simpler, but it holds potentially a lot of element if
> > the oldest one is there for a long time The second way implies we can
> > relink elements in the list, ie this is a double-linked list
> >
> > Let's assume we go for the first way. In any case, the list will
> > *always* have at least one element, even if its counter is zero :
> >
> >    head
> >    +---+
> >    | N | revision
> >    +---+
> >    | 0 | users counter (AtomicInteger)
> >    +---+
> >    |   |--o
> >    +---+
> > o--|   |
> >    +---+
> >
> > A thread wanting to use revision N will simply increment the counter,
> > which is an AtomicInteger, guaranteing that we can't have a wrong
> > update of this value. The revision will never change.
> > At this point, using the head is quite straight-forward.
> >
> > * Adding an element
> >
> > So the writter creates a new revision. This will create a new element :
> >
> >     new      head
> >    +---+     +---+
> >    |N+1|     | N |<<--- T1, T2, T5
> >    +---+     +---+
> >    | 0 |     | 3 |
> >    +---+     +---+
> >    |   |--o  |   |--o
> >    +---+     +---+
> > o--|   |  o--|   |
> >    +---+     +---+
> >
> > The Head should always be pointing to one single cell, and any thread
> > should be able to get it. This is easy to do if we use an
> AtomicReference :
> >
> >     new      [head]
> >    +---+     +---+
> >    |N+1|     | N |<<--- T1, T2, T5
> >    +---+     +---+
> >    | 0 |     | 3 |
> >    +---+     +---+
> >    |   |--o  |   |--o
> >    +---+     +---+
> > o--|   |  o--|   |
> >    +---+     +---+
> >
> > Swapping the head is done after we have updated the links, which is
> > done by the writter thread :
> >
> > Step 1 :
> >
> > Attach the head to the new element
> >
> >     new      [head]
> >    +---+     +---+
> >    |N+1|     | N |<<--- T1, T2, T5
> >    +---+     +---+
> >    | 0 |     | 3 |
> >    +---+     +---+
> >    |   |--o  |   |--o
> >    +---+     +---+
> > o--|   |<----|   |
> >    +---+     +---+
> >
> > Step 2 :
> >
> > Attach the new element to the head (Here, at the same time, a new
> > thread needs the latest revison (still N))
> >
> >
> >     new      [head]
> >    +---+     +---+
> >    |N+1|     | N |<<--- T1, T2, T5, T7
> >    +---+     +---+
> >    | 0 |     | 4 |
> >    +---+     +---+
> >    |   |---->|   |--o
> >    +---+     +---+
> > o--|   |<----|   |
> >    +---+     +---+
> >
> > Step 3 :
> >
> > Swap the head
> >
> >    [head]
> >    +---+     +---+
> >    |N+1|     | N |<<--- T1, T2, T5, T7
> >    +---+     +---+
> >    | 0 |     | 4 |
> >    +---+     +---+
> >    |   |---->|   |--o
> >    +---+     +---+
> > o--|   |<----|   |
> >    +---+     +---+
> >
> > Step 4 :
> >
> > A new thread need the latest revision
> >
> >
> >            +--- T8
> >    [head]  |
> >    +---+   | +---+
> >    |N+1|<<-+ | N |<<--- T1, T2, T5, T7
> >    +---+     +---+
> >    | 1 |     | 4 |
> >    +---+     +---+
> >    |   |---->|   |--o
> >    +---+     +---+
> > o--|   |<----|   |
> >    +---+     +---+
> >
> > At this point, no other thread will *ever* be attached to revision N
> >
> > This can go on an on, with new revisions attached to the list by the
> > writter thread. Now, let's consider two different use case :
> > 1) when the oldest revision has a counter going down to zero
> > 2) when a revision which is not the head nor the oldest that is going
> > to zero
> >
> > First case, a first approach :
> >
> > This is the simplest. We have to remove the element from the lists,
> > and check if its parent's counter is also zero and is not the head. If
> > so, we iterate until we meet an element which counter is not zero and
> > is not the head.
> >
> > In any case, we simply have to change its parent's next pointer, and
> > to remove the element from the list :
> >
> >            +--- T8   +-- T1, T2, T5, T7
> >    [head]  |         |
> >    +---+   | +---+   | +---+     +---+
> >    |N+1|<<-+ | N |<<-+ |N-1|     |N-2|x---- T9 This thread will release
> > the element, setting teh counter to 0
> >    +---+     +---+     +---+     +---+
> >    | 1 |     | 4 |     | 0 |     | 0 |  Was 1
> >    +---+     +---+     +---+     +---+
> >    |   |---->|   |---->|   |---->|   |--o
> >    +---+     +---+     +---+     +---+
> > o--|   |<----|   |<----|   |<----|   |
> >    +---+     +---+     +---+     +---+
> >
> > The thread that set the counter to 0
> >
> > Step 1 :
> >
> > Detach the last element parent's next pointer
> >
> >            +--- T8   +-- T1, T2, T5, T7
> >    [head]  |         |
> >    +---+   | +---+   | +---+     +---+
> >    |N+1|<<-+ | N |<<-+ |N-1|     |N-2|x---- T9
> >    +---+     +---+     +---+     +---+
> >    | 1 |     | 4 |     | 0 |     | 0 |
> >    +---+     +---+     +---+     +---+
> >    |   |---->|   |---->|   |--o  |   |--o
> >    +---+     +---+     +---+     +---+
> > o--|   |<----|   |<----|   |<----|   |
> >    +---+     +---+     +---+     +---+
> >
> > Step 2 :
> > Iterate if the parent's counter is 0
> >
> >            +--- T8   +-- T1, T2, T5, T7
> >    [head]  |         |
> >    +---+   | +---+   | +---+     +---+
> >    |N+1|<<-+ | N |<<-+ |N-1|     |N-2|x---- T9
> >    +---+     +---+     +---+     +---+
> >    | 1 |     | 4 |     | 0 |     | 0 |
> >    +---+     +---+     +---+     +---+
> >    |   |---->|   |--o  |   |--o  |   |--o
> >    +---+     +---+     +---+     +---+
> > o--|   |<----|   |<----|   |<----|   |
> >    +---+     +---+     +---+     +---+
> >
> > Step 3 :
> > Detach the last element which counter is 0 prev pointer
> >
> >            +--- T8   +-- T1, T2, T5, T7
> >    [head]  |         |
> >    +---+   | +---+   |     +---+     +---+
> >    |N+1|<<-+ | N |<<-+     |N-1|     |N-2|x---- T9
> >    +---+     +---+         +---+     +---+
> >    | 1 |     | 4 |         | 0 |     | 0 |
> >    +---+     +---+         +---+     +---+
> >    |   |---->|   |--o      |   |--o  |   |--o
> >    +---+     +---+         +---+     +---+
> > o--|   |<----|   |      o--|   |<----|   |
> >    +---+     +---+         +---+     +---+
> >
> > Step 4 :
> >
> >  We can now dispose the unlinked elements.
> >
> > What if a thread is freeing an element while we are already free
> > another one ? This can be exposed using the previous sample :
> >
> >            +--- T8   +-- T1, T2, T5, T7
> >    [head]  |         |
> >    +---+   | +---+   | +---+     +---+
> >    |N+1|<<-+ | N |<<-+ |N-1|     |N-2|x---- T9 This thread will release
> > the element, setting teh counter to 0
> >    +---+     +---+     +---+     +---+
> >    | 1 |     | 4 |     | 1 |     | 0 |  Was 1
> >    +---+     +---+     +---+     +---+
> >    |   |---->|   |---->|   |---->|   |--o
> >    +---+     +---+     +---+     +---+
> > o--|   |<----|   |<----|   |<----|   |
> >    +---+     +---+     +---+     +---+
> >
> > Step 0+
> >
> > A thread (10) release the N-1 revison. We have two possibilities here :
> >
> > 1) The N-2 parent's next is not yet detached
> >
> >
> >            +--- T8   +-- T1, T2, T5, T7
> >    [head]  |         |
> >    +---+   | +---+   | +---+     +---+
> >    |N+1|<<-+ | N |<<-+ |N-1|     |N-2|x---- T9 This thread will release
> > the element, setting teh counter to 0
> >    +---+     +---+     +---+     +---+
> >    | 1 |     | 4 |     | 0 |     | 0 |  Was 1
> >    +---+     +---+     +---+     +---+
> >    |   |---->|   |---->|   |---->|   |--o
> >    +---+     +---+     +---+     +---+
> > o--|   |<----|   |<----|   |<----|   |
> >    +---+     +---+     +---+     +---+
> >
> > We just continue walking the list up to the first element with a non
> > zero counter
> >
> > 2) The N-2 paren'ts next is already detached
> >
> >
> >
> >            +--- T8   +-- T1, T2, T5, T7
> >    [head]  |         |         +-- T10
> >    +---+   | +---+   | +---+   | +---+
> >    |N+1|<<-+ | N |<<-+ |N-1|x--+ |N-2|x---- T9 This thread will
> > release the element, setting teh counter to 0
> >    +---+     +---+     +---+     +---+
> >    | 1 |     | 4 |     | 0 |     | 0 |  Was 1
> >    +---+     +---+     +---+     +---+
> >    |   |---->|   |---->|   |--o  |   |--o
> >    +---+     +---+     +---+     +---+
> > o--|   |<----|   |<----|   |<----|   |
> >    +---+     +---+     +---+     +---+
> >
> > In this case, we will have 2 threads dealing with two elements to be
> > cleaned up. This is not a problem, as T9 will now detach the N-2
> > element from teh N-1 one and T9 will detach the N-1 element from the
> > list. All is safe.
> >
> > About the removed elements
> > --------------------------
> >
> > Once a thread has detached some elements which counters are equal to
> > 0, what should we do with them ? As we will update the disk with the
> > removed pages, we have to send them to teh Writter thread. This can be
> > done by pushing teminto a deque (and more specifically a
> > ConcurrentLinkedDeque)
> >
> > Another option is that the writer thread handle the cleaning of this
> > queue when it is weaken up or periodically. In this case, we don't
> > need to push the element sinto a deque, and the reader threads will
> > just set the counter to 0.
> >
> > Second case
> > -----------
> >
> > What if we accept the removal of elements from the middle of the list ?
> > This is way more complex, as we have to be sure we don't break the chain.
> > One way to get rid of this constraint is to enforce the writer thread
> > to do the job, by periodically checing the whole list (which shuld not
> > be too costly). In this case, the reader thread never update the
> > element's parent's next pointer, they just decrement teh counter.
> > When this counter becomes 0, then we can remove the element.
> >
> > Solution
> > ========
> >
> > We will separate the threads into 2 categories :
> > - readers (we may have many)
> > - writer (we have only one)
> >
> > The readers always peek the latest revision (which is the head of the
> > list). It then increment the counter. When teh thread does not need
> > the revision anymore, the counter is decremented. This is the only
> > action the reader thread will ever do on the elements and on the list,
> > but it will always keep a reference to the element into the session.
> >
> > The writer thread has a lot more to do :
> > - it is responsible for adding element at the top of the list (see
> upper).
> > Once the head is swapped, every new reader will get the added element
> > as a reference.
> > - it also has to check in the list for the elements which counter is
> > 0, if teh list is longer than 1 element. This check can be done after
> > each update, or periodically (every N seconds if the number of
> > elements is > 1, or after M additions) if tht help saving updates on
> disk.
> >
> > Referencing pages
> > -----------------
> >
> > All the B-tree revisions are kept into the BoB B-tree, so it's easy to
> > retrieve the rootPage for a given revision.
> >
> > As all the pages can't be hold in memory, we sometime have to reach a
> > page from disk or from the cache. This is done following this simple
> algorithm :
> >
> >   for a given page with offset O
> >     if the cache contains an instance for O
> >       return the page
> >     else
> >       fetch the page with offset O from disk
> >       store it in the cache
> >
> > In our case, we just add a layer, which will check in the working
> > memory for the instance of a page, given its offset.
> >
> >
> >
>
>
> --
> Ashma Shrestha
> PHP Developer
> (770) 310-9456
> ashma.shrestha@gmail.com
>



-- 
Ashma Shrestha
PHP Developer
(770) 310-9456
ashma.shrestha@gmail.com

RE: ApacheDS with Mavibot anytime soon?

Posted by "Accorsi, Carlo" <Ca...@siemens.com>.

HI Ashma, 
We have all our directory content dumped out as an LDIF on a periodic schedule. If the system crashes, we delete the partition folders on the file system, restart the server and  re-inject the entries from the ldif. 

If your directory is small < 1000 entries, you can just get them perhaps all in one search result. 
Otherwise use a paged result to loop through each entry and call:

String ldif = LdifUtils.convertToLdif(entry)

and write / append the string value to a file. This is your backup ldif file. 

Hope this helps.. 




-----Original Message-----
From: Ashma Shrestha [mailto:ashma.shrestha@gmail.com] 
Sent: Thursday, March 24, 2016 4:38 PM
To: users@directory.apache.org
Subject: Re: ApacheDS with Mavibot anytime soon?

Hi Carlos,

You have mentioned "we get the 'ERR_554 double get for block' and then we need to restore. We have an automated way of doing this now so it's not too disruptive when it happens." . I am running into the same ERR_554 issue, can you let me know how you fixed it?

Thanks,

Ashma

On Mon, Sep 28, 2015 at 9:15 AM, <Ca...@ibs-ag.com> wrote:

> Emmanuel! Wow!! Thank you for that most detailed and comprehensive 
> write up of the concurrency problem you are working to solve with 
> Mavibot.  The design sounds world class.
> I now fully appreciate the complexity of this transaction issue and 
> the ASCII art helped make it clear. We realize you're all doing your 
> best to devise a robust solution for this and so we'll sit tight for now.
> I appreciate the time you took to explain this to us. Have fun at the 
> conference.
> Best, Carlo Accorsi
>
>
>
>
> -----Original Message-----
> From: Emmanuel Lécharny [mailto:elecharny@gmail.com]
> Sent: Sunday, September 27, 2015 1:23 PM
> To: users@directory.apache.org
> Subject: Re: ApacheDS with Mavibot anytime soon?
>
> Le 23/09/15 21:41, Carlo.Accorsi@ibs-ag.com a écrit :
> > Hi,
>
> hi,
>
> sorry for the delayed response ! I saw your mail 3 days ago, but I 
> forgot to answer back then, being caught in a storm of delaying events...
> >
> > We have one customer that wants to get replication implemented. 
> > We've
> set it up in house and it seems to work well.
> > The trouble is there are 10's of thousands of users and all sorts of
> concurrent activity. They're on M16 and a few times per month, we get 
> the
> 'ERR_554 double get for block' and then we need to restore.
> > We have an automated way of doing this now so it's not too 
> > disruptive
> when it happens.
> Hopefully. Still I understand this is not convenient.
> >
> > My question is, would adding replication just add more concurrent
> activity to the servers and lead more frequently to the ERR_554 situation?
> For the consumer, it will be just as if the updates were coming from a 
> standard user. So, no, I don't think it's a real problem.
> >
> > Also he's asked to upgrade to M20 but my hunch is that for this 
> > issue it
> won't matter much because we're still on JDBM?
>
> True. It's just that some bugs have been fixed in M20.
> >
> > Finally, we're eager to test out a release with the Mavibot backed.
> > I'm sure it's possible to build ourselves but we're trying to keep 
> > the
> moving parts to a minimum and use only ApacheDS releases.
> I understand.
>
> Mavibot work has been delayed, and we are working on it as much as we 
> can, but it's not enough. Kiran and I have day jobs atm that forbid us 
> to dedicate as much time as would be necessary to complete the last 
> part that is mandatory to get this problem fixed : the transaction 
> system. We restarted to 'think' about the best implementation 2 weeks 
> ago (on your free time) and we expect to be able to spend more time on 
> it next week, as we will both be at Budapest, during the Apache Conference.
>
> Transaction  support (the way we need it, ie cross-btree) is not that 
> simple to impelment. We have a pretty clear idea on how it should 
> work, but that mpacts the current design, as we need to woirk in 
> memory, and avoid to copy pages that have already been modifed in the 
> transaction boundaries. We also have to deal with the cleanup of old 
> versions, which also means we need to implement the version manager.
>
> To give you a insight on what I'm coming for, here is a draft of my 
> thoughts about this transaction manager (note that this is my vision, 
> that I haven't shared yet, because I don't think it's covering all the 
> moving parts, but still, I think it won't be that different to what we 
> will end with ) :
>
> Transaction support
> -------------------
>
> Mavibot must support transaction across many B-tres (ie, whatever the 
> number of b-trees we are updating, we should wait before committing 
> the changes up to the point the transaction is closed). That means we 
> may have many updates pending in memory. In any case, if a rollback is 
> done, the modified b-trees will remain intact, in the same state they 
> were before the transaction started.
>
> To achive this, we need to work in a working memory. As we may have to 
> fetch pages that are to be updated, we will face three cases :
> - the page is not in memory :
> we have to fetch them from disk, update them and store it in the 
> working memory we can read it back if needed
> - the page is in the cache :
> we have to copy it to the working memory
> - the page is in the working memory : we simply update it, leaving it 
> in the working memory.
>
> So if we have to modify the same page many times, it will be updated 
> in the working memory, we won't need to copy it.
>
> That actually change the current Mavibot implementation, as it's 
> copying pages no matter what. Here, if the page is coming from the 
> working memory, we just update it.
>
>
> Layout
> ------
>
>  Disk            Cache           Working
>   ___                            Memory
>  /   \          .------.
> +     +        /      /|         .------.
> |\___/|       .------. |       .------. |
> +     +       |      | |       |      | |
> |\___/| ----> |      | | ----> |      | |
> +     +       |      |/        |      |/
> |\___/|       .------.         .------.
> +     +
>  \___/
>
>  When the transaction is rollbacked, we simply delete the content of 
> the working memory.
>  When the transaction is committed, we write all the pages in the 
> working memory on disk, and into the cache, we update the header, and 
> we purge the working memory.
>
>  If a crash occurs during a transaction, then either the transaction 
> is rollbacked (if the process is still running) - which is just about 
> purging the working memory -, or there will be nothing to do if we 
> have to restart the application, as the modified pages were in memory 
> nad have been removed while the process crashed.
>
> OTOH, when we read, we must be sure that the B-trees are present, and 
> that we always use the B-tree revisions that are correlated. This is 
> critical when using LDAP, where many B-trees are updated during a sinlge write.
>
> The way to do that is to use a revision, and always use the B-trees 
> which revision is just below or equal to this revision. Let's see with 
> an example
>
>  +-----+---+---+---+---+---+---+---+...
>  |\ Rev|   |   |   |   |   |   |   |
>  | \   |   |   |   |   |   |   |   |
>  |  \  | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8
>  |B- \ |   |   |   |   |   |   |   |
>  |tree\|   |   |   |   |   |   | L | N
>  +-----+---+---+---+---+---+---+---+...
>  | aaa | X |   |   | X | X | X |   |
>  |     | 1 | 1 | 1 | 4 | 5 | 6 | 6 |
>  +-----+---+---+---+---+---+---+---+...
>  | bbb | X | X | X |   | X | X | X |
>  |     | 1 | 2 | 3 | 3 | 5 | 6 | 7 |
>  +-----+---+---+---+---+---+---+---+...
>  | ccc | X |   | X | X | X |   | X |
>  |     | 1 | 1 | 3 | 4 | 5 | 5 | 7 |
>  +-----+---+---+---+---+---+---+---+...
>  | ddd | X | X |   |   | X |   |   |
>  |     | 1 | 2 | 2 | 2 | 5 | 5 | 5 |
>  +-----+---+---+---+---+---+---+---+...
>  | eee | X |   | X | X |   | X |   |
>  |     | 1 | 1 | 3 | 4 | 4 | 6 | 6 |
>  +-----+---+---+---+---+---+---+---+...
>
> L : Latest revision
> N : New revision
>
> In this table, we have had 7 updates, and 7 revsions. As all the 
> B-trees haven't been updated, we may have some 'holes' in the revision 
> associated to a B-tree. Typically, here, the 'aaa' B-tree will have 4 
> revisions (1, 4,
> 5 and 6) created, because this B-tree has been updated
> 4 times.
>
> Doing a read on revision 7 will use the following B-trees : aaa[6], 
> bbb[7], ccc[7], ddd[5], eee[6].
>
>
> All in all, a read is associated to a global transaction revision 
> which is used as a marker. Every B-tree update is done using the new revision.
>
> Managing on-going reads
> -----------------------
>
> We maintain all the B-tree revisions associated with a read, until the 
> read is completed, we need to get rid of the old revision when we are done.
> This is critical to keep the database size to a minimum. In order to 
> do that, we must know if an older revision is still in use. We then 
> need to keep a common data structure that is shared across threads, 
> which will hold all the used revisions, ordered. With the previous 
> example, assuming revisions 2, 4 and 6 are in use, we may use a linked list for that purpose :
>
>    +---+   +---+   +---+
> o--| 2 |<--| 4 |<--| 6 |
>    +--2+   +--1+   +--3+
>
> When the read using rev 4 is done, we remove it from this list :
>           +-------------+
>    +---+  |    +---+    | +---+
> o--| 2 |<-+  X-| 4 |    +-| 6 |
>    +--2+       +--0+      +--3+
>
> The two remaining revisions are still in use, and one of them is older.
> We can clean the B-tree pages associated with revision 4 if there is 
> already a new revision for this B-tree, and for the pages that have 
> been copied.
>
> When the read using rev 2 is done, we can clean all the B-tree pages 
> that are associated with every revision below 6, as soon as there is a 
> B-tree revision above the oldest one.
>
> An other important point : each revision might be used more than once.
> Releasing a read will decrement the count, and when it goes down to 0, 
> we can safely remove the element from the list. We need a thread-safe 
> data structure for that.
>
> Managing the list
> -----------------
>
> let's go a bit deeper in the management of this list. It has 
> interesting characteristics :
>
> - elements are added by a unique thread : the writter
> - elements are not removed unless the count becomes 0
> - the count is equal to the maximum number of thread that can use the 
> element at some point
> - it's not because the counter is N that N thread *are* using this 
> element
> : we may eventually have no thread using it ! (but they *may* come 
> back
> later...)
> - each element must be be protected agains concurrent update (which is 
> a different requirement than protecting the full list)
> - any thread requiring an element in the list will always use the head 
> of the least (which is the most recent)
> - we don't do any list traversal ! Once a thread uses an element, it 
> remembers about its position
> - no user Thread is allowed to remove an element from this list. The 
> only thread that can do that is the Cleaner Thread
>
>
> The last characteristic is interesting : that means we can safely 
> depends on one single thread to do the cleaning, so there is no 
> concurrent access on teh list while deleting elements.
>
> The deletion can be done in two ways :
> - from the tail of the list, if the element's counter is equal to zero 
> (and we can iterate)
> - from the inner of the list, when the element's counter to be removed 
> is zero
>
> The first way is simpler, but it holds potentially a lot of element if 
> the oldest one is there for a long time The second way implies we can 
> relink elements in the list, ie this is a double-linked list
>
> Let's assume we go for the first way. In any case, the list will
> *always* have at least one element, even if its counter is zero :
>
>    head
>    +---+
>    | N | revision
>    +---+
>    | 0 | users counter (AtomicInteger)
>    +---+
>    |   |--o
>    +---+
> o--|   |
>    +---+
>
> A thread wanting to use revision N will simply increment the counter, 
> which is an AtomicInteger, guaranteing that we can't have a wrong 
> update of this value. The revision will never change.
> At this point, using the head is quite straight-forward.
>
> * Adding an element
>
> So the writter creates a new revision. This will create a new element :
>
>     new      head
>    +---+     +---+
>    |N+1|     | N |<<--- T1, T2, T5
>    +---+     +---+
>    | 0 |     | 3 |
>    +---+     +---+
>    |   |--o  |   |--o
>    +---+     +---+
> o--|   |  o--|   |
>    +---+     +---+
>
> The Head should always be pointing to one single cell, and any thread 
> should be able to get it. This is easy to do if we use an AtomicReference :
>
>     new      [head]
>    +---+     +---+
>    |N+1|     | N |<<--- T1, T2, T5
>    +---+     +---+
>    | 0 |     | 3 |
>    +---+     +---+
>    |   |--o  |   |--o
>    +---+     +---+
> o--|   |  o--|   |
>    +---+     +---+
>
> Swapping the head is done after we have updated the links, which is 
> done by the writter thread :
>
> Step 1 :
>
> Attach the head to the new element
>
>     new      [head]
>    +---+     +---+
>    |N+1|     | N |<<--- T1, T2, T5
>    +---+     +---+
>    | 0 |     | 3 |
>    +---+     +---+
>    |   |--o  |   |--o
>    +---+     +---+
> o--|   |<----|   |
>    +---+     +---+
>
> Step 2 :
>
> Attach the new element to the head (Here, at the same time, a new 
> thread needs the latest revison (still N))
>
>
>     new      [head]
>    +---+     +---+
>    |N+1|     | N |<<--- T1, T2, T5, T7
>    +---+     +---+
>    | 0 |     | 4 |
>    +---+     +---+
>    |   |---->|   |--o
>    +---+     +---+
> o--|   |<----|   |
>    +---+     +---+
>
> Step 3 :
>
> Swap the head
>
>    [head]
>    +---+     +---+
>    |N+1|     | N |<<--- T1, T2, T5, T7
>    +---+     +---+
>    | 0 |     | 4 |
>    +---+     +---+
>    |   |---->|   |--o
>    +---+     +---+
> o--|   |<----|   |
>    +---+     +---+
>
> Step 4 :
>
> A new thread need the latest revision
>
>
>            +--- T8
>    [head]  |
>    +---+   | +---+
>    |N+1|<<-+ | N |<<--- T1, T2, T5, T7
>    +---+     +---+
>    | 1 |     | 4 |
>    +---+     +---+
>    |   |---->|   |--o
>    +---+     +---+
> o--|   |<----|   |
>    +---+     +---+
>
> At this point, no other thread will *ever* be attached to revision N
>
> This can go on an on, with new revisions attached to the list by the 
> writter thread. Now, let's consider two different use case :
> 1) when the oldest revision has a counter going down to zero
> 2) when a revision which is not the head nor the oldest that is going 
> to zero
>
> First case, a first approach :
>
> This is the simplest. We have to remove the element from the lists, 
> and check if its parent's counter is also zero and is not the head. If 
> so, we iterate until we meet an element which counter is not zero and 
> is not the head.
>
> In any case, we simply have to change its parent's next pointer, and 
> to remove the element from the list :
>
>            +--- T8   +-- T1, T2, T5, T7
>    [head]  |         |
>    +---+   | +---+   | +---+     +---+
>    |N+1|<<-+ | N |<<-+ |N-1|     |N-2|x---- T9 This thread will release
> the element, setting teh counter to 0
>    +---+     +---+     +---+     +---+
>    | 1 |     | 4 |     | 0 |     | 0 |  Was 1
>    +---+     +---+     +---+     +---+
>    |   |---->|   |---->|   |---->|   |--o
>    +---+     +---+     +---+     +---+
> o--|   |<----|   |<----|   |<----|   |
>    +---+     +---+     +---+     +---+
>
> The thread that set the counter to 0
>
> Step 1 :
>
> Detach the last element parent's next pointer
>
>            +--- T8   +-- T1, T2, T5, T7
>    [head]  |         |
>    +---+   | +---+   | +---+     +---+
>    |N+1|<<-+ | N |<<-+ |N-1|     |N-2|x---- T9
>    +---+     +---+     +---+     +---+
>    | 1 |     | 4 |     | 0 |     | 0 |
>    +---+     +---+     +---+     +---+
>    |   |---->|   |---->|   |--o  |   |--o
>    +---+     +---+     +---+     +---+
> o--|   |<----|   |<----|   |<----|   |
>    +---+     +---+     +---+     +---+
>
> Step 2 :
> Iterate if the parent's counter is 0
>
>            +--- T8   +-- T1, T2, T5, T7
>    [head]  |         |
>    +---+   | +---+   | +---+     +---+
>    |N+1|<<-+ | N |<<-+ |N-1|     |N-2|x---- T9
>    +---+     +---+     +---+     +---+
>    | 1 |     | 4 |     | 0 |     | 0 |
>    +---+     +---+     +---+     +---+
>    |   |---->|   |--o  |   |--o  |   |--o
>    +---+     +---+     +---+     +---+
> o--|   |<----|   |<----|   |<----|   |
>    +---+     +---+     +---+     +---+
>
> Step 3 :
> Detach the last element which counter is 0 prev pointer
>
>            +--- T8   +-- T1, T2, T5, T7
>    [head]  |         |
>    +---+   | +---+   |     +---+     +---+
>    |N+1|<<-+ | N |<<-+     |N-1|     |N-2|x---- T9
>    +---+     +---+         +---+     +---+
>    | 1 |     | 4 |         | 0 |     | 0 |
>    +---+     +---+         +---+     +---+
>    |   |---->|   |--o      |   |--o  |   |--o
>    +---+     +---+         +---+     +---+
> o--|   |<----|   |      o--|   |<----|   |
>    +---+     +---+         +---+     +---+
>
> Step 4 :
>
>  We can now dispose the unlinked elements.
>
> What if a thread is freeing an element while we are already free 
> another one ? This can be exposed using the previous sample :
>
>            +--- T8   +-- T1, T2, T5, T7
>    [head]  |         |
>    +---+   | +---+   | +---+     +---+
>    |N+1|<<-+ | N |<<-+ |N-1|     |N-2|x---- T9 This thread will release
> the element, setting teh counter to 0
>    +---+     +---+     +---+     +---+
>    | 1 |     | 4 |     | 1 |     | 0 |  Was 1
>    +---+     +---+     +---+     +---+
>    |   |---->|   |---->|   |---->|   |--o
>    +---+     +---+     +---+     +---+
> o--|   |<----|   |<----|   |<----|   |
>    +---+     +---+     +---+     +---+
>
> Step 0+
>
> A thread (10) release the N-1 revison. We have two possibilities here :
>
> 1) The N-2 parent's next is not yet detached
>
>
>            +--- T8   +-- T1, T2, T5, T7
>    [head]  |         |
>    +---+   | +---+   | +---+     +---+
>    |N+1|<<-+ | N |<<-+ |N-1|     |N-2|x---- T9 This thread will release
> the element, setting teh counter to 0
>    +---+     +---+     +---+     +---+
>    | 1 |     | 4 |     | 0 |     | 0 |  Was 1
>    +---+     +---+     +---+     +---+
>    |   |---->|   |---->|   |---->|   |--o
>    +---+     +---+     +---+     +---+
> o--|   |<----|   |<----|   |<----|   |
>    +---+     +---+     +---+     +---+
>
> We just continue walking the list up to the first element with a non 
> zero counter
>
> 2) The N-2 paren'ts next is already detached
>
>
>
>            +--- T8   +-- T1, T2, T5, T7
>    [head]  |         |         +-- T10
>    +---+   | +---+   | +---+   | +---+
>    |N+1|<<-+ | N |<<-+ |N-1|x--+ |N-2|x---- T9 This thread will 
> release the element, setting teh counter to 0
>    +---+     +---+     +---+     +---+
>    | 1 |     | 4 |     | 0 |     | 0 |  Was 1
>    +---+     +---+     +---+     +---+
>    |   |---->|   |---->|   |--o  |   |--o
>    +---+     +---+     +---+     +---+
> o--|   |<----|   |<----|   |<----|   |
>    +---+     +---+     +---+     +---+
>
> In this case, we will have 2 threads dealing with two elements to be 
> cleaned up. This is not a problem, as T9 will now detach the N-2 
> element from teh N-1 one and T9 will detach the N-1 element from the 
> list. All is safe.
>
> About the removed elements
> --------------------------
>
> Once a thread has detached some elements which counters are equal to 
> 0, what should we do with them ? As we will update the disk with the 
> removed pages, we have to send them to teh Writter thread. This can be 
> done by pushing teminto a deque (and more specifically a
> ConcurrentLinkedDeque)
>
> Another option is that the writer thread handle the cleaning of this 
> queue when it is weaken up or periodically. In this case, we don't 
> need to push the element sinto a deque, and the reader threads will 
> just set the counter to 0.
>
> Second case
> -----------
>
> What if we accept the removal of elements from the middle of the list ?
> This is way more complex, as we have to be sure we don't break the chain.
> One way to get rid of this constraint is to enforce the writer thread 
> to do the job, by periodically checing the whole list (which shuld not 
> be too costly). In this case, the reader thread never update the 
> element's parent's next pointer, they just decrement teh counter.
> When this counter becomes 0, then we can remove the element.
>
> Solution
> ========
>
> We will separate the threads into 2 categories :
> - readers (we may have many)
> - writer (we have only one)
>
> The readers always peek the latest revision (which is the head of the 
> list). It then increment the counter. When teh thread does not need 
> the revision anymore, the counter is decremented. This is the only 
> action the reader thread will ever do on the elements and on the list, 
> but it will always keep a reference to the element into the session.
>
> The writer thread has a lot more to do :
> - it is responsible for adding element at the top of the list (see upper).
> Once the head is swapped, every new reader will get the added element 
> as a reference.
> - it also has to check in the list for the elements which counter is 
> 0, if teh list is longer than 1 element. This check can be done after 
> each update, or periodically (every N seconds if the number of 
> elements is > 1, or after M additions) if tht help saving updates on disk.
>
> Referencing pages
> -----------------
>
> All the B-tree revisions are kept into the BoB B-tree, so it's easy to 
> retrieve the rootPage for a given revision.
>
> As all the pages can't be hold in memory, we sometime have to reach a 
> page from disk or from the cache. This is done following this simple algorithm :
>
>   for a given page with offset O
>     if the cache contains an instance for O
>       return the page
>     else
>       fetch the page with offset O from disk
>       store it in the cache
>
> In our case, we just add a layer, which will check in the working 
> memory for the instance of a page, given its offset.
>
>
>


--
Ashma Shrestha
PHP Developer
(770) 310-9456
ashma.shrestha@gmail.com

Re: ApacheDS with Mavibot anytime soon?

Posted by Ashma Shrestha <as...@gmail.com>.

Hi Carlos,

You have mentioned "we get the 'ERR_554 double get for block' and then we
need to restore. We have an automated way of doing this now so it's not too
disruptive when it happens." . I am running into the same ERR_554 issue,
can you let me know how you fixed it?

Thanks,

Ashma

On Mon, Sep 28, 2015 at 9:15 AM, <Ca...@ibs-ag.com> wrote:

> Emmanuel! Wow!! Thank you for that most detailed and comprehensive write
> up of the concurrency problem you are working to solve with Mavibot.  The
> design sounds world class.
> I now fully appreciate the complexity of this transaction issue and the
> ASCII art helped make it clear. We realize you're all doing your
> best to devise a robust solution for this and so we'll sit tight for now.
> I appreciate the time you took to explain this to us. Have fun at the
> conference.
> Best, Carlo Accorsi
>
>
>
>
> -----Original Message-----
> From: Emmanuel Lécharny [mailto:elecharny@gmail.com]
> Sent: Sunday, September 27, 2015 1:23 PM
> To: users@directory.apache.org
> Subject: Re: ApacheDS with Mavibot anytime soon?
>
> Le 23/09/15 21:41, Carlo.Accorsi@ibs-ag.com a écrit :
> > Hi,
>
> hi,
>
> sorry for the delayed response ! I saw your mail 3 days ago, but I forgot
> to answer back then, being caught in a storm of delaying events...
> >
> > We have one customer that wants to get replication implemented. We've
> set it up in house and it seems to work well.
> > The trouble is there are 10's of thousands of users and all sorts of
> concurrent activity. They're on M16 and a few times per month, we get the
> 'ERR_554 double get for block' and then we need to restore.
> > We have an automated way of doing this now so it's not too disruptive
> when it happens.
> Hopefully. Still I understand this is not convenient.
> >
> > My question is, would adding replication just add more concurrent
> activity to the servers and lead more frequently to the ERR_554 situation?
> For the consumer, it will be just as if the updates were coming from a
> standard user. So, no, I don't think it's a real problem.
> >
> > Also he's asked to upgrade to M20 but my hunch is that for this issue it
> won't matter much because we're still on JDBM?
>
> True. It's just that some bugs have been fixed in M20.
> >
> > Finally, we're eager to test out a release with the Mavibot backed.
> > I'm sure it's possible to build ourselves but we're trying to keep the
> moving parts to a minimum and use only ApacheDS releases.
> I understand.
>
> Mavibot work has been delayed, and we are working on it as much as we can,
> but it's not enough. Kiran and I have day jobs atm that forbid us to
> dedicate as much time as would be necessary to complete the last part that
> is mandatory to get this problem fixed : the transaction system. We
> restarted to 'think' about the best implementation 2 weeks ago (on your
> free time) and we expect to be able to spend more time on it next week, as
> we will both be at Budapest, during the Apache Conference.
>
> Transaction  support (the way we need it, ie cross-btree) is not that
> simple to impelment. We have a pretty clear idea on how it should work, but
> that mpacts the current design, as we need to woirk in memory, and avoid to
> copy pages that have already been modifed in the transaction boundaries. We
> also have to deal with the cleanup of old versions, which also means we
> need to implement the version manager.
>
> To give you a insight on what I'm coming for, here is a draft of my
> thoughts about this transaction manager (note that this is my vision, that
> I haven't shared yet, because I don't think it's covering all the moving
> parts, but still, I think it won't be that different to what we will end
> with ) :
>
> Transaction support
> -------------------
>
> Mavibot must support transaction across many B-tres (ie, whatever the
> number of b-trees we are updating, we should wait before committing the
> changes up to the point the transaction is closed). That means we may have
> many updates pending in memory. In any case, if a rollback is done, the
> modified b-trees will remain intact, in the same state they were before the
> transaction started.
>
> To achive this, we need to work in a working memory. As we may have to
> fetch pages that are to be updated, we will face three cases :
> - the page is not in memory :
> we have to fetch them from disk, update them and store it in the working
> memory we can read it back if needed
> - the page is in the cache :
> we have to copy it to the working memory
> - the page is in the working memory : we simply update it, leaving it in
> the working memory.
>
> So if we have to modify the same page many times, it will be updated in
> the working memory, we won't need to copy it.
>
> That actually change the current Mavibot implementation, as it's copying
> pages no matter what. Here, if the page is coming from the working memory,
> we just update it.
>
>
> Layout
> ------
>
>  Disk            Cache           Working
>   ___                            Memory
>  /   \          .------.
> +     +        /      /|         .------.
> |\___/|       .------. |       .------. |
> +     +       |      | |       |      | |
> |\___/| ----> |      | | ----> |      | |
> +     +       |      |/        |      |/
> |\___/|       .------.         .------.
> +     +
>  \___/
>
>  When the transaction is rollbacked, we simply delete the content of the
> working memory.
>  When the transaction is committed, we write all the pages in the working
> memory on disk, and into the cache, we update the header, and we purge the
> working memory.
>
>  If a crash occurs during a transaction, then either the transaction is
> rollbacked (if the process is still running) - which is just about purging
> the working memory -, or there will be nothing to do if we have to restart
> the application, as the modified pages were in memory nad have been removed
> while the process crashed.
>
> OTOH, when we read, we must be sure that the B-trees are present, and that
> we always use the B-tree revisions that are correlated. This is critical
> when using LDAP, where many B-trees are updated during a sinlge write.
>
> The way to do that is to use a revision, and always use the B-trees which
> revision is just below or equal to this revision. Let's see with an example
>
>  +-----+---+---+---+---+---+---+---+...
>  |\ Rev|   |   |   |   |   |   |   |
>  | \   |   |   |   |   |   |   |   |
>  |  \  | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8
>  |B- \ |   |   |   |   |   |   |   |
>  |tree\|   |   |   |   |   |   | L | N
>  +-----+---+---+---+---+---+---+---+...
>  | aaa | X |   |   | X | X | X |   |
>  |     | 1 | 1 | 1 | 4 | 5 | 6 | 6 |
>  +-----+---+---+---+---+---+---+---+...
>  | bbb | X | X | X |   | X | X | X |
>  |     | 1 | 2 | 3 | 3 | 5 | 6 | 7 |
>  +-----+---+---+---+---+---+---+---+...
>  | ccc | X |   | X | X | X |   | X |
>  |     | 1 | 1 | 3 | 4 | 5 | 5 | 7 |
>  +-----+---+---+---+---+---+---+---+...
>  | ddd | X | X |   |   | X |   |   |
>  |     | 1 | 2 | 2 | 2 | 5 | 5 | 5 |
>  +-----+---+---+---+---+---+---+---+...
>  | eee | X |   | X | X |   | X |   |
>  |     | 1 | 1 | 3 | 4 | 4 | 6 | 6 |
>  +-----+---+---+---+---+---+---+---+...
>
> L : Latest revision
> N : New revision
>
> In this table, we have had 7 updates, and 7 revsions. As all the B-trees
> haven't been updated, we may have some 'holes' in the revision associated
> to a B-tree. Typically, here, the 'aaa' B-tree will have 4 revisions (1, 4,
> 5 and 6) created, because this B-tree has been updated
> 4 times.
>
> Doing a read on revision 7 will use the following B-trees : aaa[6],
> bbb[7], ccc[7], ddd[5], eee[6].
>
>
> All in all, a read is associated to a global transaction revision which is
> used as a marker. Every B-tree update is done using the new revision.
>
> Managing on-going reads
> -----------------------
>
> We maintain all the B-tree revisions associated with a read, until the
> read is completed, we need to get rid of the old revision when we are done.
> This is critical to keep the database size to a minimum. In order to do
> that, we must know if an older revision is still in use. We then need to
> keep a common data structure that is shared across threads, which will hold
> all the used revisions, ordered. With the previous example, assuming
> revisions 2, 4 and 6 are in use, we may use a linked list for that purpose :
>
>    +---+   +---+   +---+
> o--| 2 |<--| 4 |<--| 6 |
>    +--2+   +--1+   +--3+
>
> When the read using rev 4 is done, we remove it from this list :
>           +-------------+
>    +---+  |    +---+    | +---+
> o--| 2 |<-+  X-| 4 |    +-| 6 |
>    +--2+       +--0+      +--3+
>
> The two remaining revisions are still in use, and one of them is older.
> We can clean the B-tree pages associated with revision 4 if there is
> already a new revision for this B-tree, and for the pages that have been
> copied.
>
> When the read using rev 2 is done, we can clean all the B-tree pages that
> are associated with every revision below 6, as soon as there is a B-tree
> revision above the oldest one.
>
> An other important point : each revision might be used more than once.
> Releasing a read will decrement the count, and when it goes down to 0, we
> can safely remove the element from the list. We need a thread-safe data
> structure for that.
>
> Managing the list
> -----------------
>
> let's go a bit deeper in the management of this list. It has interesting
> characteristics :
>
> - elements are added by a unique thread : the writter
> - elements are not removed unless the count becomes 0
> - the count is equal to the maximum number of thread that can use the
> element at some point
> - it's not because the counter is N that N thread *are* using this element
> : we may eventually have no thread using it ! (but they *may* come back
> later...)
> - each element must be be protected agains concurrent update (which is a
> different requirement than protecting the full list)
> - any thread requiring an element in the list will always use the head of
> the least (which is the most recent)
> - we don't do any list traversal ! Once a thread uses an element, it
> remembers about its position
> - no user Thread is allowed to remove an element from this list. The only
> thread that can do that is the Cleaner Thread
>
>
> The last characteristic is interesting : that means we can safely depends
> on one single thread to do the cleaning, so there is no concurrent access
> on teh list while deleting elements.
>
> The deletion can be done in two ways :
> - from the tail of the list, if the element's counter is equal to zero
> (and we can iterate)
> - from the inner of the list, when the element's counter to be removed is
> zero
>
> The first way is simpler, but it holds potentially a lot of element if the
> oldest one is there for a long time The second way implies we can relink
> elements in the list, ie this is a double-linked list
>
> Let's assume we go for the first way. In any case, the list will
> *always* have at least one element, even if its counter is zero :
>
>    head
>    +---+
>    | N | revision
>    +---+
>    | 0 | users counter (AtomicInteger)
>    +---+
>    |   |--o
>    +---+
> o--|   |
>    +---+
>
> A thread wanting to use revision N will simply increment the counter,
> which is an AtomicInteger, guaranteing that we can't have a wrong update of
> this value. The revision will never change.
> At this point, using the head is quite straight-forward.
>
> * Adding an element
>
> So the writter creates a new revision. This will create a new element :
>
>     new      head
>    +---+     +---+
>    |N+1|     | N |<<--- T1, T2, T5
>    +---+     +---+
>    | 0 |     | 3 |
>    +---+     +---+
>    |   |--o  |   |--o
>    +---+     +---+
> o--|   |  o--|   |
>    +---+     +---+
>
> The Head should always be pointing to one single cell, and any thread
> should be able to get it. This is easy to do if we use an AtomicReference :
>
>     new      [head]
>    +---+     +---+
>    |N+1|     | N |<<--- T1, T2, T5
>    +---+     +---+
>    | 0 |     | 3 |
>    +---+     +---+
>    |   |--o  |   |--o
>    +---+     +---+
> o--|   |  o--|   |
>    +---+     +---+
>
> Swapping the head is done after we have updated the links, which is done
> by the writter thread :
>
> Step 1 :
>
> Attach the head to the new element
>
>     new      [head]
>    +---+     +---+
>    |N+1|     | N |<<--- T1, T2, T5
>    +---+     +---+
>    | 0 |     | 3 |
>    +---+     +---+
>    |   |--o  |   |--o
>    +---+     +---+
> o--|   |<----|   |
>    +---+     +---+
>
> Step 2 :
>
> Attach the new element to the head (Here, at the same time, a new thread
> needs the latest revison (still N))
>
>
>     new      [head]
>    +---+     +---+
>    |N+1|     | N |<<--- T1, T2, T5, T7
>    +---+     +---+
>    | 0 |     | 4 |
>    +---+     +---+
>    |   |---->|   |--o
>    +---+     +---+
> o--|   |<----|   |
>    +---+     +---+
>
> Step 3 :
>
> Swap the head
>
>    [head]
>    +---+     +---+
>    |N+1|     | N |<<--- T1, T2, T5, T7
>    +---+     +---+
>    | 0 |     | 4 |
>    +---+     +---+
>    |   |---->|   |--o
>    +---+     +---+
> o--|   |<----|   |
>    +---+     +---+
>
> Step 4 :
>
> A new thread need the latest revision
>
>
>            +--- T8
>    [head]  |
>    +---+   | +---+
>    |N+1|<<-+ | N |<<--- T1, T2, T5, T7
>    +---+     +---+
>    | 1 |     | 4 |
>    +---+     +---+
>    |   |---->|   |--o
>    +---+     +---+
> o--|   |<----|   |
>    +---+     +---+
>
> At this point, no other thread will *ever* be attached to revision N
>
> This can go on an on, with new revisions attached to the list by the
> writter thread. Now, let's consider two different use case :
> 1) when the oldest revision has a counter going down to zero
> 2) when a revision which is not the head nor the oldest that is going to
> zero
>
> First case, a first approach :
>
> This is the simplest. We have to remove the element from the lists, and
> check if its parent's counter is also zero and is not the head. If so, we
> iterate until we meet an element which counter is not zero and is not the
> head.
>
> In any case, we simply have to change its parent's next pointer, and to
> remove the element from the list :
>
>            +--- T8   +-- T1, T2, T5, T7
>    [head]  |         |
>    +---+   | +---+   | +---+     +---+
>    |N+1|<<-+ | N |<<-+ |N-1|     |N-2|x---- T9 This thread will release
> the element, setting teh counter to 0
>    +---+     +---+     +---+     +---+
>    | 1 |     | 4 |     | 0 |     | 0 |  Was 1
>    +---+     +---+     +---+     +---+
>    |   |---->|   |---->|   |---->|   |--o
>    +---+     +---+     +---+     +---+
> o--|   |<----|   |<----|   |<----|   |
>    +---+     +---+     +---+     +---+
>
> The thread that set the counter to 0
>
> Step 1 :
>
> Detach the last element parent's next pointer
>
>            +--- T8   +-- T1, T2, T5, T7
>    [head]  |         |
>    +---+   | +---+   | +---+     +---+
>    |N+1|<<-+ | N |<<-+ |N-1|     |N-2|x---- T9
>    +---+     +---+     +---+     +---+
>    | 1 |     | 4 |     | 0 |     | 0 |
>    +---+     +---+     +---+     +---+
>    |   |---->|   |---->|   |--o  |   |--o
>    +---+     +---+     +---+     +---+
> o--|   |<----|   |<----|   |<----|   |
>    +---+     +---+     +---+     +---+
>
> Step 2 :
> Iterate if the parent's counter is 0
>
>            +--- T8   +-- T1, T2, T5, T7
>    [head]  |         |
>    +---+   | +---+   | +---+     +---+
>    |N+1|<<-+ | N |<<-+ |N-1|     |N-2|x---- T9
>    +---+     +---+     +---+     +---+
>    | 1 |     | 4 |     | 0 |     | 0 |
>    +---+     +---+     +---+     +---+
>    |   |---->|   |--o  |   |--o  |   |--o
>    +---+     +---+     +---+     +---+
> o--|   |<----|   |<----|   |<----|   |
>    +---+     +---+     +---+     +---+
>
> Step 3 :
> Detach the last element which counter is 0 prev pointer
>
>            +--- T8   +-- T1, T2, T5, T7
>    [head]  |         |
>    +---+   | +---+   |     +---+     +---+
>    |N+1|<<-+ | N |<<-+     |N-1|     |N-2|x---- T9
>    +---+     +---+         +---+     +---+
>    | 1 |     | 4 |         | 0 |     | 0 |
>    +---+     +---+         +---+     +---+
>    |   |---->|   |--o      |   |--o  |   |--o
>    +---+     +---+         +---+     +---+
> o--|   |<----|   |      o--|   |<----|   |
>    +---+     +---+         +---+     +---+
>
> Step 4 :
>
>  We can now dispose the unlinked elements.
>
> What if a thread is freeing an element while we are already free another
> one ? This can be exposed using the previous sample :
>
>            +--- T8   +-- T1, T2, T5, T7
>    [head]  |         |
>    +---+   | +---+   | +---+     +---+
>    |N+1|<<-+ | N |<<-+ |N-1|     |N-2|x---- T9 This thread will release
> the element, setting teh counter to 0
>    +---+     +---+     +---+     +---+
>    | 1 |     | 4 |     | 1 |     | 0 |  Was 1
>    +---+     +---+     +---+     +---+
>    |   |---->|   |---->|   |---->|   |--o
>    +---+     +---+     +---+     +---+
> o--|   |<----|   |<----|   |<----|   |
>    +---+     +---+     +---+     +---+
>
> Step 0+
>
> A thread (10) release the N-1 revison. We have two possibilities here :
>
> 1) The N-2 parent's next is not yet detached
>
>
>            +--- T8   +-- T1, T2, T5, T7
>    [head]  |         |
>    +---+   | +---+   | +---+     +---+
>    |N+1|<<-+ | N |<<-+ |N-1|     |N-2|x---- T9 This thread will release
> the element, setting teh counter to 0
>    +---+     +---+     +---+     +---+
>    | 1 |     | 4 |     | 0 |     | 0 |  Was 1
>    +---+     +---+     +---+     +---+
>    |   |---->|   |---->|   |---->|   |--o
>    +---+     +---+     +---+     +---+
> o--|   |<----|   |<----|   |<----|   |
>    +---+     +---+     +---+     +---+
>
> We just continue walking the list up to the first element with a non zero
> counter
>
> 2) The N-2 paren'ts next is already detached
>
>
>
>            +--- T8   +-- T1, T2, T5, T7
>    [head]  |         |         +-- T10
>    +---+   | +---+   | +---+   | +---+
>    |N+1|<<-+ | N |<<-+ |N-1|x--+ |N-2|x---- T9 This thread will release
> the element, setting teh counter to 0
>    +---+     +---+     +---+     +---+
>    | 1 |     | 4 |     | 0 |     | 0 |  Was 1
>    +---+     +---+     +---+     +---+
>    |   |---->|   |---->|   |--o  |   |--o
>    +---+     +---+     +---+     +---+
> o--|   |<----|   |<----|   |<----|   |
>    +---+     +---+     +---+     +---+
>
> In this case, we will have 2 threads dealing with two elements to be
> cleaned up. This is not a problem, as T9 will now detach the N-2 element
> from teh N-1 one and T9 will detach the N-1 element from the list. All is
> safe.
>
> About the removed elements
> --------------------------
>
> Once a thread has detached some elements which counters are equal to 0,
> what should we do with them ? As we will update the disk with the removed
> pages, we have to send them to teh Writter thread. This can be done by
> pushing teminto a deque (and more specifically a
> ConcurrentLinkedDeque)
>
> Another option is that the writer thread handle the cleaning of this queue
> when it is weaken up or periodically. In this case, we don't need to push
> the element sinto a deque, and the reader threads will just set the counter
> to 0.
>
> Second case
> -----------
>
> What if we accept the removal of elements from the middle of the list ?
> This is way more complex, as we have to be sure we don't break the chain.
> One way to get rid of this constraint is to enforce the writer thread to do
> the job, by periodically checing the whole list (which shuld not be too
> costly). In this case, the reader thread never update the element's
> parent's next pointer, they just decrement teh counter.
> When this counter becomes 0, then we can remove the element.
>
> Solution
> ========
>
> We will separate the threads into 2 categories :
> - readers (we may have many)
> - writer (we have only one)
>
> The readers always peek the latest revision (which is the head of the
> list). It then increment the counter. When teh thread does not need the
> revision anymore, the counter is decremented. This is the only action the
> reader thread will ever do on the elements and on the list, but it will
> always keep a reference to the element into the session.
>
> The writer thread has a lot more to do :
> - it is responsible for adding element at the top of the list (see upper).
> Once the head is swapped, every new reader will get the added element as a
> reference.
> - it also has to check in the list for the elements which counter is 0, if
> teh list is longer than 1 element. This check can be done after each
> update, or periodically (every N seconds if the number of elements is > 1,
> or after M additions) if tht help saving updates on disk.
>
> Referencing pages
> -----------------
>
> All the B-tree revisions are kept into the BoB B-tree, so it's easy to
> retrieve the rootPage for a given revision.
>
> As all the pages can't be hold in memory, we sometime have to reach a page
> from disk or from the cache. This is done following this simple algorithm :
>
>   for a given page with offset O
>     if the cache contains an instance for O
>       return the page
>     else
>       fetch the page with offset O from disk
>       store it in the cache
>
> In our case, we just add a layer, which will check in the working memory
> for the instance of a page, given its offset.
>
>
>


-- 
Ashma Shrestha
PHP Developer
(770) 310-9456
ashma.shrestha@gmail.com

RE: ApacheDS with Mavibot anytime soon?

Posted by Ca...@ibs-ag.com.

Emmanuel! Wow!! Thank you for that most detailed and comprehensive write up of the concurrency problem you are working to solve with Mavibot.  The design sounds world class.
I now fully appreciate the complexity of this transaction issue and the ASCII art helped make it clear. We realize you're all doing your 
best to devise a robust solution for this and so we'll sit tight for now. I appreciate the time you took to explain this to us. Have fun at the conference. 
Best, Carlo Accorsi




-----Original Message-----
From: Emmanuel Lécharny [mailto:elecharny@gmail.com] 
Sent: Sunday, September 27, 2015 1:23 PM
To: users@directory.apache.org
Subject: Re: ApacheDS with Mavibot anytime soon?

Le 23/09/15 21:41, Carlo.Accorsi@ibs-ag.com a écrit :
> Hi,

hi,

sorry for the delayed response ! I saw your mail 3 days ago, but I forgot to answer back then, being caught in a storm of delaying events...
>
> We have one customer that wants to get replication implemented. We've set it up in house and it seems to work well.
> The trouble is there are 10's of thousands of users and all sorts of concurrent activity. They're on M16 and a few times per month, we get the 'ERR_554 double get for block' and then we need to restore.
> We have an automated way of doing this now so it's not too disruptive when it happens.
Hopefully. Still I understand this is not convenient.
>
> My question is, would adding replication just add more concurrent activity to the servers and lead more frequently to the ERR_554 situation?
For the consumer, it will be just as if the updates were coming from a standard user. So, no, I don't think it's a real problem.
>
> Also he's asked to upgrade to M20 but my hunch is that for this issue it won't matter much because we're still on JDBM?

True. It's just that some bugs have been fixed in M20.
>
> Finally, we're eager to test out a release with the Mavibot backed. 
> I'm sure it's possible to build ourselves but we're trying to keep the moving parts to a minimum and use only ApacheDS releases.
I understand.

Mavibot work has been delayed, and we are working on it as much as we can, but it's not enough. Kiran and I have day jobs atm that forbid us to dedicate as much time as would be necessary to complete the last part that is mandatory to get this problem fixed : the transaction system. We restarted to 'think' about the best implementation 2 weeks ago (on your free time) and we expect to be able to spend more time on it next week, as we will both be at Budapest, during the Apache Conference.

Transaction  support (the way we need it, ie cross-btree) is not that simple to impelment. We have a pretty clear idea on how it should work, but that mpacts the current design, as we need to woirk in memory, and avoid to copy pages that have already been modifed in the transaction boundaries. We also have to deal with the cleanup of old versions, which also means we need to implement the version manager.

To give you a insight on what I'm coming for, here is a draft of my thoughts about this transaction manager (note that this is my vision, that I haven't shared yet, because I don't think it's covering all the moving parts, but still, I think it won't be that different to what we will end with ) :

Transaction support
-------------------

Mavibot must support transaction across many B-tres (ie, whatever the number of b-trees we are updating, we should wait before committing the changes up to the point the transaction is closed). That means we may have many updates pending in memory. In any case, if a rollback is done, the modified b-trees will remain intact, in the same state they were before the transaction started.

To achive this, we need to work in a working memory. As we may have to fetch pages that are to be updated, we will face three cases :
- the page is not in memory :
we have to fetch them from disk, update them and store it in the working memory we can read it back if needed
- the page is in the cache :
we have to copy it to the working memory
- the page is in the working memory : we simply update it, leaving it in the working memory.

So if we have to modify the same page many times, it will be updated in the working memory, we won't need to copy it.

That actually change the current Mavibot implementation, as it's copying pages no matter what. Here, if the page is coming from the working memory, we just update it.


Layout
------

 Disk            Cache           Working
  ___                            Memory
 /   \          .------.       
+     +        /      /|         .------.
|\___/|       .------. |       .------. |
+     +       |      | |       |      | |
|\___/| ----> |      | | ----> |      | |
+     +       |      |/        |      |/
|\___/|       .------.         .------.
+     +
 \___/
 
 When the transaction is rollbacked, we simply delete the content of the working memory.
 When the transaction is committed, we write all the pages in the working memory on disk, and into the cache, we update the header, and we purge the working memory.
 
 If a crash occurs during a transaction, then either the transaction is rollbacked (if the process is still running) - which is just about purging the working memory -, or there will be nothing to do if we have to restart the application, as the modified pages were in memory nad have been removed while the process crashed.

OTOH, when we read, we must be sure that the B-trees are present, and that we always use the B-tree revisions that are correlated. This is critical when using LDAP, where many B-trees are updated during a sinlge write.

The way to do that is to use a revision, and always use the B-trees which revision is just below or equal to this revision. Let's see with an example

 +-----+---+---+---+---+---+---+---+...
 |\ Rev|   |   |   |   |   |   |   |
 | \   |   |   |   |   |   |   |   |
 |  \  | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8
 |B- \ |   |   |   |   |   |   |   |
 |tree\|   |   |   |   |   |   | L | N
 +-----+---+---+---+---+---+---+---+...
 | aaa | X |   |   | X | X | X |   |
 |     | 1 | 1 | 1 | 4 | 5 | 6 | 6 |
 +-----+---+---+---+---+---+---+---+...
 | bbb | X | X | X |   | X | X | X |
 |     | 1 | 2 | 3 | 3 | 5 | 6 | 7 |
 +-----+---+---+---+---+---+---+---+...
 | ccc | X |   | X | X | X |   | X |
 |     | 1 | 1 | 3 | 4 | 5 | 5 | 7 |
 +-----+---+---+---+---+---+---+---+...
 | ddd | X | X |   |   | X |   |   |
 |     | 1 | 2 | 2 | 2 | 5 | 5 | 5 |
 +-----+---+---+---+---+---+---+---+...
 | eee | X |   | X | X |   | X |   |
 |     | 1 | 1 | 3 | 4 | 4 | 6 | 6 |
 +-----+---+---+---+---+---+---+---+...
 
L : Latest revision
N : New revision
 
In this table, we have had 7 updates, and 7 revsions. As all the B-trees haven't been updated, we may have some 'holes' in the revision associated to a B-tree. Typically, here, the 'aaa' B-tree will have 4 revisions (1, 4, 5 and 6) created, because this B-tree has been updated
4 times.
 
Doing a read on revision 7 will use the following B-trees : aaa[6], bbb[7], ccc[7], ddd[5], eee[6].


All in all, a read is associated to a global transaction revision which is used as a marker. Every B-tree update is done using the new revision.

Managing on-going reads
-----------------------

We maintain all the B-tree revisions associated with a read, until the read is completed, we need to get rid of the old revision when we are done. This is critical to keep the database size to a minimum. In order to do that, we must know if an older revision is still in use. We then need to keep a common data structure that is shared across threads, which will hold all the used revisions, ordered. With the previous example, assuming revisions 2, 4 and 6 are in use, we may use a linked list for that purpose :

   +---+   +---+   +---+
o--| 2 |<--| 4 |<--| 6 |
   +--2+   +--1+   +--3+
  
When the read using rev 4 is done, we remove it from this list :
          +-------------+
   +---+  |    +---+    | +---+
o--| 2 |<-+  X-| 4 |    +-| 6 |
   +--2+       +--0+      +--3+
                
The two remaining revisions are still in use, and one of them is older.
We can clean the B-tree pages associated with revision 4 if there is already a new revision for this B-tree, and for the pages that have been copied.

When the read using rev 2 is done, we can clean all the B-tree pages that are associated with every revision below 6, as soon as there is a B-tree revision above the oldest one.

An other important point : each revision might be used more than once.
Releasing a read will decrement the count, and when it goes down to 0, we can safely remove the element from the list. We need a thread-safe data structure for that.

Managing the list
-----------------

let's go a bit deeper in the management of this list. It has interesting characteristics :

- elements are added by a unique thread : the writter
- elements are not removed unless the count becomes 0
- the count is equal to the maximum number of thread that can use the element at some point
- it's not because the counter is N that N thread *are* using this element : we may eventually have no thread using it ! (but they *may* come back later...)
- each element must be be protected agains concurrent update (which is a different requirement than protecting the full list)
- any thread requiring an element in the list will always use the head of the least (which is the most recent)
- we don't do any list traversal ! Once a thread uses an element, it remembers about its position
- no user Thread is allowed to remove an element from this list. The only thread that can do that is the Cleaner Thread


The last characteristic is interesting : that means we can safely depends on one single thread to do the cleaning, so there is no concurrent access on teh list while deleting elements.

The deletion can be done in two ways :
- from the tail of the list, if the element's counter is equal to zero (and we can iterate)
- from the inner of the list, when the element's counter to be removed is zero

The first way is simpler, but it holds potentially a lot of element if the oldest one is there for a long time The second way implies we can relink elements in the list, ie this is a double-linked list

Let's assume we go for the first way. In any case, the list will
*always* have at least one element, even if its counter is zero :

   head
   +---+
   | N | revision
   +---+
   | 0 | users counter (AtomicInteger)
   +---+
   |   |--o
   +---+
o--|   |
   +---+
  
A thread wanting to use revision N will simply increment the counter, which is an AtomicInteger, guaranteing that we can't have a wrong update of this value. The revision will never change.
At this point, using the head is quite straight-forward.

* Adding an element

So the writter creates a new revision. This will create a new element :

    new      head
   +---+     +---+
   |N+1|     | N |<<--- T1, T2, T5
   +---+     +---+
   | 0 |     | 3 |
   +---+     +---+
   |   |--o  |   |--o
   +---+     +---+
o--|   |  o--|   |
   +---+     +---+
  
The Head should always be pointing to one single cell, and any thread should be able to get it. This is easy to do if we use an AtomicReference :

    new      [head]
   +---+     +---+
   |N+1|     | N |<<--- T1, T2, T5
   +---+     +---+
   | 0 |     | 3 |
   +---+     +---+
   |   |--o  |   |--o
   +---+     +---+
o--|   |  o--|   |
   +---+     +---+
  
Swapping the head is done after we have updated the links, which is done by the writter thread :

Step 1 :

Attach the head to the new element

    new      [head]
   +---+     +---+
   |N+1|     | N |<<--- T1, T2, T5
   +---+     +---+
   | 0 |     | 3 |
   +---+     +---+
   |   |--o  |   |--o
   +---+     +---+
o--|   |<----|   |
   +---+     +---+
  
Step 2 :

Attach the new element to the head (Here, at the same time, a new thread needs the latest revison (still N))


    new      [head]
   +---+     +---+
   |N+1|     | N |<<--- T1, T2, T5, T7
   +---+     +---+
   | 0 |     | 4 |
   +---+     +---+
   |   |---->|   |--o
   +---+     +---+
o--|   |<----|   |
   +---+     +---+

Step 3 :

Swap the head

   [head]
   +---+     +---+
   |N+1|     | N |<<--- T1, T2, T5, T7
   +---+     +---+
   | 0 |     | 4 |
   +---+     +---+
   |   |---->|   |--o
   +---+     +---+
o--|   |<----|   |
   +---+     +---+

Step 4 :

A new thread need the latest revision


           +--- T8
   [head]  |
   +---+   | +---+
   |N+1|<<-+ | N |<<--- T1, T2, T5, T7
   +---+     +---+
   | 1 |     | 4 |
   +---+     +---+
   |   |---->|   |--o
   +---+     +---+
o--|   |<----|   |
   +---+     +---+
 
At this point, no other thread will *ever* be attached to revision N

This can go on an on, with new revisions attached to the list by the writter thread. Now, let's consider two different use case :
1) when the oldest revision has a counter going down to zero
2) when a revision which is not the head nor the oldest that is going to zero

First case, a first approach :

This is the simplest. We have to remove the element from the lists, and check if its parent's counter is also zero and is not the head. If so, we iterate until we meet an element which counter is not zero and is not the head.

In any case, we simply have to change its parent's next pointer, and to remove the element from the list :

           +--- T8   +-- T1, T2, T5, T7
   [head]  |         |
   +---+   | +---+   | +---+     +---+
   |N+1|<<-+ | N |<<-+ |N-1|     |N-2|x---- T9 This thread will release
the element, setting teh counter to 0
   +---+     +---+     +---+     +---+
   | 1 |     | 4 |     | 0 |     | 0 |  Was 1
   +---+     +---+     +---+     +---+
   |   |---->|   |---->|   |---->|   |--o
   +---+     +---+     +---+     +---+
o--|   |<----|   |<----|   |<----|   |
   +---+     +---+     +---+     +---+

The thread that set the counter to 0

Step 1 :

Detach the last element parent's next pointer

           +--- T8   +-- T1, T2, T5, T7
   [head]  |         |
   +---+   | +---+   | +---+     +---+
   |N+1|<<-+ | N |<<-+ |N-1|     |N-2|x---- T9
   +---+     +---+     +---+     +---+
   | 1 |     | 4 |     | 0 |     | 0 |
   +---+     +---+     +---+     +---+
   |   |---->|   |---->|   |--o  |   |--o
   +---+     +---+     +---+     +---+
o--|   |<----|   |<----|   |<----|   |
   +---+     +---+     +---+     +---+

Step 2 :
Iterate if the parent's counter is 0

           +--- T8   +-- T1, T2, T5, T7
   [head]  |         |
   +---+   | +---+   | +---+     +---+
   |N+1|<<-+ | N |<<-+ |N-1|     |N-2|x---- T9
   +---+     +---+     +---+     +---+
   | 1 |     | 4 |     | 0 |     | 0 |
   +---+     +---+     +---+     +---+
   |   |---->|   |--o  |   |--o  |   |--o
   +---+     +---+     +---+     +---+
o--|   |<----|   |<----|   |<----|   |
   +---+     +---+     +---+     +---+

Step 3 :
Detach the last element which counter is 0 prev pointer

           +--- T8   +-- T1, T2, T5, T7
   [head]  |         |
   +---+   | +---+   |     +---+     +---+
   |N+1|<<-+ | N |<<-+     |N-1|     |N-2|x---- T9
   +---+     +---+         +---+     +---+
   | 1 |     | 4 |         | 0 |     | 0 |
   +---+     +---+         +---+     +---+
   |   |---->|   |--o      |   |--o  |   |--o
   +---+     +---+         +---+     +---+
o--|   |<----|   |      o--|   |<----|   |
   +---+     +---+         +---+     +---+
 
Step 4 :

 We can now dispose the unlinked elements.

What if a thread is freeing an element while we are already free another one ? This can be exposed using the previous sample :

           +--- T8   +-- T1, T2, T5, T7
   [head]  |         |
   +---+   | +---+   | +---+     +---+
   |N+1|<<-+ | N |<<-+ |N-1|     |N-2|x---- T9 This thread will release
the element, setting teh counter to 0
   +---+     +---+     +---+     +---+
   | 1 |     | 4 |     | 1 |     | 0 |  Was 1
   +---+     +---+     +---+     +---+
   |   |---->|   |---->|   |---->|   |--o
   +---+     +---+     +---+     +---+
o--|   |<----|   |<----|   |<----|   |
   +---+     +---+     +---+     +---+

Step 0+

A thread (10) release the N-1 revison. We have two possibilities here :

1) The N-2 parent's next is not yet detached


           +--- T8   +-- T1, T2, T5, T7
   [head]  |         |
   +---+   | +---+   | +---+     +---+
   |N+1|<<-+ | N |<<-+ |N-1|     |N-2|x---- T9 This thread will release
the element, setting teh counter to 0
   +---+     +---+     +---+     +---+
   | 1 |     | 4 |     | 0 |     | 0 |  Was 1
   +---+     +---+     +---+     +---+
   |   |---->|   |---->|   |---->|   |--o
   +---+     +---+     +---+     +---+
o--|   |<----|   |<----|   |<----|   |
   +---+     +---+     +---+     +---+

We just continue walking the list up to the first element with a non zero counter

2) The N-2 paren'ts next is already detached



           +--- T8   +-- T1, T2, T5, T7
   [head]  |         |         +-- T10
   +---+   | +---+   | +---+   | +---+
   |N+1|<<-+ | N |<<-+ |N-1|x--+ |N-2|x---- T9 This thread will release the element, setting teh counter to 0
   +---+     +---+     +---+     +---+
   | 1 |     | 4 |     | 0 |     | 0 |  Was 1
   +---+     +---+     +---+     +---+
   |   |---->|   |---->|   |--o  |   |--o
   +---+     +---+     +---+     +---+
o--|   |<----|   |<----|   |<----|   |
   +---+     +---+     +---+     +---+
  
In this case, we will have 2 threads dealing with two elements to be cleaned up. This is not a problem, as T9 will now detach the N-2 element from teh N-1 one and T9 will detach the N-1 element from the list. All is safe.

About the removed elements
--------------------------

Once a thread has detached some elements which counters are equal to 0, what should we do with them ? As we will update the disk with the removed pages, we have to send them to teh Writter thread. This can be done by pushing teminto a deque (and more specifically a
ConcurrentLinkedDeque)

Another option is that the writer thread handle the cleaning of this queue when it is weaken up or periodically. In this case, we don't need to push the element sinto a deque, and the reader threads will just set the counter to 0.

Second case
-----------

What if we accept the removal of elements from the middle of the list ?
This is way more complex, as we have to be sure we don't break the chain. One way to get rid of this constraint is to enforce the writer thread to do the job, by periodically checing the whole list (which shuld not be too costly). In this case, the reader thread never update the element's parent's next pointer, they just decrement teh counter.
When this counter becomes 0, then we can remove the element.

Solution
========

We will separate the threads into 2 categories :
- readers (we may have many)
- writer (we have only one)

The readers always peek the latest revision (which is the head of the list). It then increment the counter. When teh thread does not need the revision anymore, the counter is decremented. This is the only action the reader thread will ever do on the elements and on the list, but it will always keep a reference to the element into the session.

The writer thread has a lot more to do :
- it is responsible for adding element at the top of the list (see upper). Once the head is swapped, every new reader will get the added element as a reference.
- it also has to check in the list for the elements which counter is 0, if teh list is longer than 1 element. This check can be done after each update, or periodically (every N seconds if the number of elements is > 1, or after M additions) if tht help saving updates on disk.
 
Referencing pages
-----------------

All the B-tree revisions are kept into the BoB B-tree, so it's easy to retrieve the rootPage for a given revision.
 
As all the pages can't be hold in memory, we sometime have to reach a page from disk or from the cache. This is done following this simple algorithm :

  for a given page with offset O
    if the cache contains an instance for O
      return the page
    else
      fetch the page with offset O from disk
      store it in the cache
     
In our case, we just add a layer, which will check in the working memory for the instance of a page, given its offset.

Re: ApacheDS with Mavibot anytime soon?

Posted by Emmanuel Lécharny <el...@gmail.com>.

Le 23/09/15 21:41, Carlo.Accorsi@ibs-ag.com a écrit :
> Hi,

hi,

sorry for the delayed response ! I saw your mail 3 days ago, but I
forgot to answer back then, being caught in a storm of delaying events...
>
> We have one customer that wants to get replication implemented. We've set it up in house and it seems to work well.
> The trouble is there are 10's of thousands of users and all sorts of concurrent activity. They're on M16 and a few times per month, we get the 'ERR_554 double get for block' and then we need to restore.
> We have an automated way of doing this now so it's not too disruptive when it happens.
Hopefully. Still I understand this is not convenient.
>
> My question is, would adding replication just add more concurrent activity to the servers and lead more frequently to the ERR_554 situation?
For the consumer, it will be just as if the updates were coming from a
standard user. So, no, I don't think it's a real problem.
>
> Also he's asked to upgrade to M20 but my hunch is that for this issue it won't matter much because we're still on JDBM?

True. It's just that some bugs have been fixed in M20.
>
> Finally, we're eager to test out a release with the Mavibot backed. I'm sure it's possible to build ourselves but we're trying to keep the moving parts to a minimum
> and use only ApacheDS releases.
I understand.

Mavibot work has been delayed, and we are working on it as much as we
can, but it's not enough. Kiran and I have day jobs atm that forbid us
to dedicate as much time as would be necessary to complete the last part
that is mandatory to get this problem fixed : the transaction system. We
restarted to 'think' about the best implementation 2 weeks ago (on your
free time) and we expect to be able to spend more time on it next week,
as we will both be at Budapest, during the Apache Conference.

Transaction  support (the way we need it, ie cross-btree) is not that
simple to impelment. We have a pretty clear idea on how it should work,
but that mpacts the current design, as we need to woirk in memory, and
avoid to copy pages that have already been modifed in the transaction
boundaries. We also have to deal with the cleanup of old versions, which
also means we need to implement the version manager.

To give you a insight on what I'm coming for, here is a draft of my
thoughts about this transaction manager (note that this is my vision,
that I haven't shared yet, because I don't think it's covering all the
moving parts, but still, I think it won't be that different to what we
will end with ) :

Transaction support
-------------------

Mavibot must support transaction across many B-tres (ie, whatever the
number of b-trees we are updating, we should wait before committing the
changes up to the point the transaction is closed). That means we may
have many updates pending in memory. In any case, if a rollback is done,
the modified b-trees will remain intact, in the same state they were
before the transaction started.

To achive this, we need to work in a working memory. As we may have to
fetch pages that are to be updated, we will face three cases :
- the page is not in memory :
we have to fetch them from disk, update them and store it in the working
memory we can read it back if needed
- the page is in the cache :
we have to copy it to the working memory
- the page is in the working memory : we simply update it, leaving it in
the working memory.

So if we have to modify the same page many times, it will be updated in
the working memory, we won't need to copy it.

That actually change the current Mavibot implementation, as it's copying
pages no matter what. Here, if the page is coming from the working
memory, we just update it.


Layout
------

 Disk            Cache           Working
  ___                            Memory
 /   \          .------.       
+     +        /      /|         .------.
|\___/|       .------. |       .------. |
+     +       |      | |       |      | |
|\___/| ----> |      | | ----> |      | |
+     +       |      |/        |      |/
|\___/|       .------.         .------.
+     +
 \___/
 
 When the transaction is rollbacked, we simply delete the content of the
working memory.
 When the transaction is committed, we write all the pages in the
working memory on disk, and into the cache, we update the header, and we
purge the working memory.
 
 If a crash occurs during a transaction, then either the transaction is
rollbacked (if the process is still running) - which is just about
purging the working memory -, or there will be nothing to do if we have
to restart the application, as the modified pages were in memory nad
have been removed while the process crashed.

OTOH, when we read, we must be sure that the B-trees are present, and
that we always use the B-tree revisions that are correlated. This is
critical when using LDAP, where many B-trees are updated during a sinlge
write.

The way to do that is to use a revision, and always use the B-trees
which revision is just below or equal to this revision. Let's see with
an example

 +-----+---+---+---+---+---+---+---+...
 |\ Rev|   |   |   |   |   |   |   |
 | \   |   |   |   |   |   |   |   |
 |  \  | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8
 |B- \ |   |   |   |   |   |   |   |
 |tree\|   |   |   |   |   |   | L | N
 +-----+---+---+---+---+---+---+---+...
 | aaa | X |   |   | X | X | X |   |
 |     | 1 | 1 | 1 | 4 | 5 | 6 | 6 |
 +-----+---+---+---+---+---+---+---+...
 | bbb | X | X | X |   | X | X | X |
 |     | 1 | 2 | 3 | 3 | 5 | 6 | 7 |
 +-----+---+---+---+---+---+---+---+...
 | ccc | X |   | X | X | X |   | X |
 |     | 1 | 1 | 3 | 4 | 5 | 5 | 7 |
 +-----+---+---+---+---+---+---+---+...
 | ddd | X | X |   |   | X |   |   |
 |     | 1 | 2 | 2 | 2 | 5 | 5 | 5 |
 +-----+---+---+---+---+---+---+---+...
 | eee | X |   | X | X |   | X |   |
 |     | 1 | 1 | 3 | 4 | 4 | 6 | 6 |
 +-----+---+---+---+---+---+---+---+...
 
L : Latest revision
N : New revision
 
In this table, we have had 7 updates, and 7 revsions. As all the B-trees
haven't been updated, we may have some 'holes' in the revision
associated to a B-tree. Typically, here, the 'aaa' B-tree will have 4
revisions (1, 4, 5 and 6) created, because this B-tree has been updated
4 times.
 
Doing a read on revision 7 will use the following B-trees : aaa[6],
bbb[7], ccc[7], ddd[5], eee[6].


All in all, a read is associated to a global transaction revision which
is used as a marker. Every B-tree update is done using the new revision.

Managing on-going reads
-----------------------

We maintain all the B-tree revisions associated with a read, until the
read is completed, we need to get rid of the old revision when we are
done. This is critical to keep the database size to a minimum. In order
to do that, we must know if an older revision is still in use. We then
need to keep a common data structure that is shared across threads,
which will hold all the used revisions, ordered. With the previous
example, assuming revisions 2, 4 and 6 are in use, we may use a linked
list for that purpose :

   +---+   +---+   +---+
o--| 2 |<--| 4 |<--| 6 |
   +--2+   +--1+   +--3+
  
When the read using rev 4 is done, we remove it from this list :
          +-------------+
   +---+  |    +---+    | +---+
o--| 2 |<-+  X-| 4 |    +-| 6 |
   +--2+       +--0+      +--3+
                
The two remaining revisions are still in use, and one of them is older.
We can clean the B-tree pages associated with revision 4 if there is
already a new revision for this B-tree, and for the pages that have been
copied.

When the read using rev 2 is done, we can clean all the B-tree pages
that are associated with every revision below 6, as soon as there is a
B-tree revision above the oldest one.

An other important point : each revision might be used more than once.
Releasing a read will decrement the count, and when it goes down to 0,
we can safely remove the element from the list. We need a thread-safe
data structure for that.

Managing the list
-----------------

let's go a bit deeper in the management of this list. It has interesting
characteristics :

- elements are added by a unique thread : the writter
- elements are not removed unless the count becomes 0
- the count is equal to the maximum number of thread that can use the
element at some point
- it's not because the counter is N that N thread *are* using this
element : we may eventually have no thread using it ! (but they *may*
come back later...)
- each element must be be protected agains concurrent update (which is a
different requirement than protecting the full list)
- any thread requiring an element in the list will always use the head
of the least (which is the most recent)
- we don't do any list traversal ! Once a thread uses an element, it
remembers about its position
- no user Thread is allowed to remove an element from this list. The
only thread that can do that is the Cleaner Thread


The last characteristic is interesting : that means we can safely
depends on one single thread to do the cleaning, so there is no
concurrent access on teh list while deleting elements.

The deletion can be done in two ways :
- from the tail of the list, if the element's counter is equal to zero
(and we can iterate)
- from the inner of the list, when the element's counter to be removed
is zero

The first way is simpler, but it holds potentially a lot of element if
the oldest one is there for a long time
The second way implies we can relink elements in the list, ie this is a
double-linked list

Let's assume we go for the first way. In any case, the list will
*always* have at least one element, even if its counter is zero :

   head
   +---+
   | N | revision
   +---+
   | 0 | users counter (AtomicInteger)
   +---+
   |   |--o
   +---+
o--|   |
   +---+
  
A thread wanting to use revision N will simply increment the counter,
which is an AtomicInteger, guaranteing that we can't have a wrong update
of this value. The revision will never change.
At this point, using the head is quite straight-forward.

* Adding an element

So the writter creates a new revision. This will create a new element :

    new      head
   +---+     +---+
   |N+1|     | N |<<--- T1, T2, T5
   +---+     +---+
   | 0 |     | 3 |
   +---+     +---+
   |   |--o  |   |--o
   +---+     +---+
o--|   |  o--|   |
   +---+     +---+
  
The Head should always be pointing to one single cell, and any thread
should be able to get it. This is easy to do if we use an AtomicReference :

    new      [head]
   +---+     +---+
   |N+1|     | N |<<--- T1, T2, T5
   +---+     +---+
   | 0 |     | 3 |
   +---+     +---+
   |   |--o  |   |--o
   +---+     +---+
o--|   |  o--|   |
   +---+     +---+
  
Swapping the head is done after we have updated the links, which is done
by the writter thread :

Step 1 :

Attach the head to the new element

    new      [head]
   +---+     +---+
   |N+1|     | N |<<--- T1, T2, T5
   +---+     +---+
   | 0 |     | 3 |
   +---+     +---+
   |   |--o  |   |--o
   +---+     +---+
o--|   |<----|   |
   +---+     +---+
  
Step 2 :

Attach the new element to the head (Here, at the same time, a new thread
needs the latest revison (still N))


    new      [head]
   +---+     +---+
   |N+1|     | N |<<--- T1, T2, T5, T7
   +---+     +---+
   | 0 |     | 4 |
   +---+     +---+
   |   |---->|   |--o
   +---+     +---+
o--|   |<----|   |
   +---+     +---+

Step 3 :

Swap the head

   [head]
   +---+     +---+
   |N+1|     | N |<<--- T1, T2, T5, T7
   +---+     +---+
   | 0 |     | 4 |
   +---+     +---+
   |   |---->|   |--o
   +---+     +---+
o--|   |<----|   |
   +---+     +---+

Step 4 :

A new thread need the latest revision


           +--- T8
   [head]  |
   +---+   | +---+
   |N+1|<<-+ | N |<<--- T1, T2, T5, T7
   +---+     +---+
   | 1 |     | 4 |
   +---+     +---+
   |   |---->|   |--o
   +---+     +---+
o--|   |<----|   |
   +---+     +---+
 
At this point, no other thread will *ever* be attached to revision N

This can go on an on, with new revisions attached to the list by the
writter thread. Now, let's consider two different use case :
1) when the oldest revision has a counter going down to zero
2) when a revision which is not the head nor the oldest that is going to
zero

First case, a first approach :

This is the simplest. We have to remove the element from the lists, and
check if its parent's counter is also zero and is not the head. If so,
we iterate until we meet an element which counter is not zero and is not
the head.

In any case, we simply have to change its parent's next pointer, and to
remove the element from the list :

           +--- T8   +-- T1, T2, T5, T7
   [head]  |         |
   +---+   | +---+   | +---+     +---+
   |N+1|<<-+ | N |<<-+ |N-1|     |N-2|x---- T9 This thread will release
the element, setting teh counter to 0
   +---+     +---+     +---+     +---+
   | 1 |     | 4 |     | 0 |     | 0 |  Was 1
   +---+     +---+     +---+     +---+
   |   |---->|   |---->|   |---->|   |--o
   +---+     +---+     +---+     +---+
o--|   |<----|   |<----|   |<----|   |
   +---+     +---+     +---+     +---+

The thread that set the counter to 0

Step 1 :

Detach the last element parent's next pointer

           +--- T8   +-- T1, T2, T5, T7
   [head]  |         |
   +---+   | +---+   | +---+     +---+
   |N+1|<<-+ | N |<<-+ |N-1|     |N-2|x---- T9
   +---+     +---+     +---+     +---+
   | 1 |     | 4 |     | 0 |     | 0 |
   +---+     +---+     +---+     +---+
   |   |---->|   |---->|   |--o  |   |--o
   +---+     +---+     +---+     +---+
o--|   |<----|   |<----|   |<----|   |
   +---+     +---+     +---+     +---+

Step 2 :
Iterate if the parent's counter is 0

           +--- T8   +-- T1, T2, T5, T7
   [head]  |         |
   +---+   | +---+   | +---+     +---+
   |N+1|<<-+ | N |<<-+ |N-1|     |N-2|x---- T9
   +---+     +---+     +---+     +---+
   | 1 |     | 4 |     | 0 |     | 0 |
   +---+     +---+     +---+     +---+
   |   |---->|   |--o  |   |--o  |   |--o
   +---+     +---+     +---+     +---+
o--|   |<----|   |<----|   |<----|   |
   +---+     +---+     +---+     +---+

Step 3 :
Detach the last element which counter is 0 prev pointer

           +--- T8   +-- T1, T2, T5, T7
   [head]  |         |
   +---+   | +---+   |     +---+     +---+
   |N+1|<<-+ | N |<<-+     |N-1|     |N-2|x---- T9
   +---+     +---+         +---+     +---+
   | 1 |     | 4 |         | 0 |     | 0 |
   +---+     +---+         +---+     +---+
   |   |---->|   |--o      |   |--o  |   |--o
   +---+     +---+         +---+     +---+
o--|   |<----|   |      o--|   |<----|   |
   +---+     +---+         +---+     +---+
 
Step 4 :

 We can now dispose the unlinked elements.

What if a thread is freeing an element while we are already free another
one ? This can be exposed using the previous sample :

           +--- T8   +-- T1, T2, T5, T7
   [head]  |         |
   +---+   | +---+   | +---+     +---+
   |N+1|<<-+ | N |<<-+ |N-1|     |N-2|x---- T9 This thread will release
the element, setting teh counter to 0
   +---+     +---+     +---+     +---+
   | 1 |     | 4 |     | 1 |     | 0 |  Was 1
   +---+     +---+     +---+     +---+
   |   |---->|   |---->|   |---->|   |--o
   +---+     +---+     +---+     +---+
o--|   |<----|   |<----|   |<----|   |
   +---+     +---+     +---+     +---+

Step 0+

A thread (10) release the N-1 revison. We have two possibilities here :

1) The N-2 parent's next is not yet detached


           +--- T8   +-- T1, T2, T5, T7
   [head]  |         |
   +---+   | +---+   | +---+     +---+
   |N+1|<<-+ | N |<<-+ |N-1|     |N-2|x---- T9 This thread will release
the element, setting teh counter to 0
   +---+     +---+     +---+     +---+
   | 1 |     | 4 |     | 0 |     | 0 |  Was 1
   +---+     +---+     +---+     +---+
   |   |---->|   |---->|   |---->|   |--o
   +---+     +---+     +---+     +---+
o--|   |<----|   |<----|   |<----|   |
   +---+     +---+     +---+     +---+

We just continue walking the list up to the first element with a non
zero counter

2) The N-2 paren'ts next is already detached



           +--- T8   +-- T1, T2, T5, T7
   [head]  |         |         +-- T10
   +---+   | +---+   | +---+   | +---+
   |N+1|<<-+ | N |<<-+ |N-1|x--+ |N-2|x---- T9 This thread will release
the element, setting teh counter to 0
   +---+     +---+     +---+     +---+
   | 1 |     | 4 |     | 0 |     | 0 |  Was 1
   +---+     +---+     +---+     +---+
   |   |---->|   |---->|   |--o  |   |--o
   +---+     +---+     +---+     +---+
o--|   |<----|   |<----|   |<----|   |
   +---+     +---+     +---+     +---+
  
In this case, we will have 2 threads dealing with two elements to be
cleaned up. This is not a problem, as T9 will now detach the N-2 element
from teh N-1 one and T9 will detach the N-1 element from the list. All
is safe.

About the removed elements
--------------------------

Once a thread has detached some elements which counters are equal to 0,
what should we do with them ? As we will update the disk with the
removed pages, we have to send them to teh Writter thread. This can be
done by pushing teminto a deque (and more specifically a
ConcurrentLinkedDeque)

Another option is that the writer thread handle the cleaning of this
queue when it is weaken up or periodically. In this case, we don't need
to push the element sinto a deque, and the reader threads will just set
the counter to 0.

Second case
-----------

What if we accept the removal of elements from the middle of the list ?
This is way more complex, as we have to be sure we don't break the
chain. One way to get rid of this constraint is to enforce the writer
thread to do the job, by periodically checing the whole list (which
shuld not be too costly). In this case, the reader thread never update
the element's parent's next pointer, they just decrement teh counter.
When this counter becomes 0, then we can remove the element.

Solution
========

We will separate the threads into 2 categories :
- readers (we may have many)
- writer (we have only one)

The readers always peek the latest revision (which is the head of the
list). It then increment the counter. When teh thread does not need the
revision anymore, the counter is decremented. This is the only action
the reader thread will ever do on the elements and on the list, but it
will always keep a reference to the element into the session.

The writer thread has a lot more to do :
- it is responsible for adding element at the top of the list (see
upper). Once the head is swapped, every new reader will get the added
element as a reference.
- it also has to check in the list for the elements which counter is 0,
if teh list is longer than 1 element. This check can be done after each
update, or periodically (every N seconds if the number of elements is >
1, or after M additions) if tht help saving updates on disk.
 
Referencing pages
-----------------

All the B-tree revisions are kept into the BoB B-tree, so it's easy to
retrieve the rootPage for a given revision.
 
As all the pages can't be hold in memory, we sometime have to reach a
page from disk or from the cache. This is done following this simple
algorithm :

  for a given page with offset O
    if the cache contains an instance for O
      return the page
    else
      fetch the page with offset O from disk
      store it in the cache
     
In our case, we just add a layer, which will check in the working memory
for the instance of a page, given its offset.