You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@directory.apache.org by Emmanuel Lecharny <el...@gmail.com> on 2010/05/27 17:40:35 UTC

About operation atomicity

Hi guys,

as we already know, we may have issue in the way we handle modify 
operations in the server (add, modify, delete, moddn). Basically, we 
just synchronized the following methods in the AbstractStore class :
- add
- delete
- modify
- move
- rename

That's good enough to protect the server against concurrent 
modifications, but the problem is that behind the curtain, they are all 
being processed by a backend implementation, which can be Jdbm, HBase, 
Avl, Oracle, ... All those guys are dealing with concurrent access in 
their own way.

Let's focus on Jdbm atm, as it's our main backend. Every operation 
access many indexes and the master table. Each index is a couple of 
JdbmTable (forward and backward), and the Master table itself is also a 
JdbmTable. A JdbmTable itself is backed by a BTree.

So far, so good.

It become a bit more complex when it comes to protecting all this 
against concurrent access :
- the upper layer (AbstractStore) does not synchronize the read operations
- nor does the JdbmIndex (add, drop(K,Long), sync and close are 
synchronized, lookups and more surprisingly drop(K) aren't)
- nor does JdbmTable( close, put, remove and sync methods are 
synchronized, get(K) isn't)
- but all the bTree methods are synchronized, including the one that 
read data.

That means that, whatever we do, we will have a bottleneck on read 
operations, and we also have overzealous synchronization as every layer 
protect all the modify operations. (regardless, this is not really an 
issue, as we don't care if we synchronize more than once as soon as we 
are already guaranteed to be thread safe once we get in the AbstractStore)

Now, the question is how can we release this contention on reads ? 
Obviously, we have to protect the way we modify data against concurrent 
modifications, otherwise the index will be broken quickly, but we want 
to allow searches to be done in parallel. That mean we must remove all 
the synchronization on reads from the BTree implementation.

If we do that, how do we guarantee that a search based on an index will 
return the correct data, instead of getting an exception because the 
index is being reorganized by another thread ?

Considering we keep the current JDBM implementation, I see two ways to 
do that :
- implementing a kind of barrier that let all the reads to be done 
without syncronization, blocking writes when they arrive until all the 
pending reads to be done, then all the following operations will have to 
wait until the write is done. If the next operations are read, then once 
the write is done, reads can be done in parallel again.
- using an optimistic lock system, assuming that a read is successful if 
no write occured between the time the read started and the time it 
finished. If not, then we redo the read. This can be done using a unique 
marker, which is incremented by any write operation. If we get any 
exception during a search, that means we tried to access some data which 
is not accessible just because a write is modifying the data structure, 
then we redo the read.

The second strategy is not 100% safe, as we have no guarantee n case we 
start a read just after a write has started, and finish the read before 
the write has finished. Let's see what it means on a time frame 
description :

scenario 1:
-----------

Thread 1 starts a read at t0
Thread 2 starts a read at t1
Thread 3 starts a write at t3 : it blocks until T0 and T1 are done

At t3, all the new read or writes operation are just blocked, waiting 
for Thread 3 to be done.

Thread 4 starts a read at t4 : it blocks waiting for T3 to be done
Thread 5 starts a read at t5 : it blocks waiting for T3 to be done
Thread 1 finishes at t6. Nothing changes.
Thread 6 starts a write at t7 : it blocks waiting for T3 to be done
Thread 2 finishes at t8 : Thread 3 is now unblocked, and can be 
processed. Threads 4, 5 and 6 are still blocked, waiting for T3 to be done
Thread 3 finishes at t9 : we can now process the pending threads : T4, 
T5 and T6
Thread 4 and 5 are reads, they are executed concurrently
Thread 6 is a write, it has to wait for both T4 and T5 to be done
etc...

Here, reads are executed concurrently without synchronization, but 
blocking the writes, and when a writes come in, everything is blocked 
until the write is done.

Scenario 2 :
------------
when a write starts, it increment a TS which is global (AtomicLong, for 
instance)

case 1 :

      W start   W ends
      |         |
      V         V
-----o.........o----
   ^         ^
   |         |
   R starts  R ends

in this case, the read will grab the TS, and when it ends, the TS it 
grabs will be different, as a write has started. We can assume that the 
data we got is not valid anymore, and we redo a read.

case 2 :

      W start   W ends
      |         |
      V         V
-----o.........o----
   ^              ^
   |              |
   R starts       R ends

same as case 1

case 3 :

      W start   W ends
      |         |
      V         V
-----o.........o----
         ^         ^
         |         |
         R starts  R ends

Here, the TS is equal, so we may have an exception, or an invalid 
result. If we get an exception, we can simply restart the read. We have 
no way to know if the result is valid, so we assume it is, because we 
don't care if iit's not (once the user has got a result, as LDAP does 
dirty reads anyway, we are safe)

case 4 :


      W start         W ends
      |               |
      V               V
-----o...............o----
         ^         ^
         |         |
         R starts  R ends

Same as case 3.

--ends of scenarios--

So we have a scenario 1 where we are totally thread safe, but where 
reads can be blocked by a write. This is 100% safe, with less contention 
than what we have right now.

Scenario 2 is a bit less safe, but we don't block reads : we just redo 
them if we get an exception.

Any better solution implies we implement a better underlying system, 
based on MVCC for instance.

We also have to keep in mind that writes impact more than one table, and 
they should be atomic, so we must synchronize writes anyway.

Now, thanks for having read this long brain dump, but I'd like to know 
what's your opinion about those two strategies, which one should we 
implement, assuming I'm not off base or that you don't have a better 
solution to propose ?






-- 
Regards,
Cordialement,
Emmanuel Lécharny
www.nextury.com



Re: About operation atomicity

Posted by Stefan Seelmann <se...@apache.org>.
Alex Karasulu wrote:
>>> One question I have is whether or not we can now get rid of the Store
>>> layer. Am thinking this might not be needed now and it makes the
>>> picture more complex when evaluating these concerns.  Can we re-factor
>>> it out of the picture? Thoughts on this?
>> Well, the Store interface contains some comments about the reason and
>> the history why a separate Store interface has been created. If that
>> isn't valid anymore I think we should merge the partition and store
>> layer.
>>
>> Kind Regards,
>> Stefan
> 
> Problem is İ don't know if changes in the code still mandate the reasons
> given in this module.

The Partition interface is defined in core-api module. xdbm-search and
core are independent, both only depend on core-api. I think it makes
sense to keep them independent from each other.

I think it still makes sense to keep the generic Partition interface.
AFAIK the Oracle partition uses it because Oracle has enough search
capabilities to conduct the searches on its own. Another use case is a
proxy partition that just redirects all operations to another LDAP server.

In contrast we should have the Xdbm-Partition. It defines the well-known
index, table, and master table structure and provides the search engine.
On top of this we have Jdbm-Partition, Avl-Partition, HBase-Partiton, etc.

So I think we can do the following:
- merge core-avl into core-api
- merge jdbm-store and jdbm-partiton
- merge avl-partition into xdbm-search
- rename xdbm-search into xdbm-partition

I can check that next weekend end report the results.

Kind Regards,
Stefan



Re: About operation atomicity

Posted by Alex Karasulu <ak...@gmail.com>.

Sent from my iPhone

On May 29, 2010, at 2:41 PM, Stefan Seelmann <se...@apache.org>  
wrote:

> Alex Karasulu wrote:
>> One question I have is whether or not we can now get rid of the Store
>> layer. Am thinking this might not be needed now and it makes the
>> picture more complex when evaluating these concerns.  Can we re- 
>> factor
>> it out of the picture? Thoughts on this?
> Well, the Store interface contains some comments about the reason and
> the history why a separate Store interface has been created. If that
> isn't valid anymore I think we should merge the partition and store  
> layer.
>
> Kind Regards,
> Stefan

Problem is İ don't know if changes in the code still mandate the  
reasons given in this module.

--Alex

Re: About operation atomicity

Posted by Stefan Seelmann <se...@apache.org>.
Alex Karasulu wrote:
> One question I have is whether or not we can now get rid of the Store
> layer. Am thinking this might not be needed now and it makes the
> picture more complex when evaluating these concerns.  Can we re-factor
> it out of the picture? Thoughts on this?
Well, the Store interface contains some comments about the reason and
the history why a separate Store interface has been created. If that
isn't valid anymore I think we should merge the partition and store layer.

Kind Regards,
Stefan


Re: About operation atomicity

Posted by Alex Karasulu <ak...@apache.org>.
On Thu, May 27, 2010 at 6:40 PM, Emmanuel Lecharny <el...@gmail.com>wrote:

> Hi guys,
>
> as we already know, we may have issue in the way we handle modify
> operations in the server (add, modify, delete, moddn). Basically, we just
> synchronized the following methods in the AbstractStore class :
> - add
> - delete
> - modify
> - move
> - rename
>
> That's good enough to protect the server against concurrent modifications,
> but the problem is that behind the curtain, they are all being processed by
> a backend implementation, which can be Jdbm, HBase, Avl, Oracle, ... All
> those guys are dealing with concurrent access in their own way.
>
>
Aye more additional synchronization occurs there too.


> Let's focus on Jdbm atm, as it's our main backend. Every operation access
> many indexes and the master table. Each index is a couple of JdbmTable
> (forward and backward), and the Master table itself is also a JdbmTable. A
> JdbmTable itself is backed by a BTree.
>
> So far, so good.
>
>
Aye.


> It become a bit more complex when it comes to protecting all this against
> concurrent access :
> - the upper layer (AbstractStore) does not synchronize the read operations
> - nor does the JdbmIndex (add, drop(K,Long), sync and close are
> synchronized, lookups and more surprisingly drop(K) aren't)
> - nor does JdbmTable( close, put, remove and sync methods are synchronized,
> get(K) isn't)
> - but all the bTree methods are synchronized, including the one that read
> data.
>
> That means that, whatever we do, we will have a bottleneck on read
> operations, and we also have overzealous synchronization as every layer
> protect all the modify operations. (regardless, this is not really an issue,
> as we don't care if we synchronize more than once as soon as we are already
> guaranteed to be thread safe once we get in the AbstractStore)
>
>
One question I have is whether or not we can now get rid of the Store layer.
Am thinking this might not be needed now and it makes the picture more
complex when evaluating these concerns.  Can we re-factor it out of the
picture? Thoughts on this?


> Now, the question is how can we release this contention on reads ?
> Obviously, we have to protect the way we modify data against concurrent
> modifications, otherwise the index will be broken quickly, but we want to
> allow searches to be done in parallel. That mean we must remove all the
> synchronization on reads from the BTree implementation.
>
> If we do that, how do we guarantee that a search based on an index will
> return the correct data, instead of getting an exception because the index
> is being reorganized by another thread ?
>
>
MVCC would protect us from this. MVCC is like working on a copy under
concurrency as you already know. Seems though we're dropping the MVCC attack
due to how complicated it will be in JDBM.


> Considering we keep the current JDBM implementation, I see two ways to do
> that :
> - implementing a kind of barrier that let all the reads to be done without
> syncronization, blocking writes when they arrive until all the pending reads
> to be done, then all the following operations will have to wait until the
> write is done. If the next operations are read, then once the write is done,
> reads can be done in parallel again.
>

True this will work but you will still not have full isolation even though
you'll remove the chance for dirty reads.  Dirty reads go away where you
will not see the partial impact of write based operations. However if a read
occurs before the write based operations complete it will see the final
state of the write even though it should not since it came before the writes
completed.


> - using an optimistic lock system, assuming that a read is successful if no
> write occured between the time the read started and the time it finished. If
> not, then we redo the read. This can be done using a unique marker, which is
> incremented by any write operation. If we get any exception during a search,
> that means we tried to access some data which is not accessible just because
> a write is modifying the data structure, then we redo the read.
>
>
I think this will yield the similar ACID behaviors as the MVCC approach
however we'll have read exceptions that require retries on reads. MVCC would
not require retries on reads.


> The second strategy is not 100% safe, as we have no guarantee n case we
> start a read just after a write has started, and finish the read before the
> write has finished. Let's see what it means on a time frame description :
>
> scenario 1:
> -----------
>
> Thread 1 starts a read at t0
> Thread 2 starts a read at t1
> Thread 3 starts a write at t3 : it blocks until T0 and T1 are done
>
> At t3, all the new read or writes operation are just blocked, waiting for
> Thread 3 to be done.
>
> Thread 4 starts a read at t4 : it blocks waiting for T3 to be done
> Thread 5 starts a read at t5 : it blocks waiting for T3 to be done
> Thread 1 finishes at t6. Nothing changes.
> Thread 6 starts a write at t7 : it blocks waiting for T3 to be done
> Thread 2 finishes at t8 : Thread 3 is now unblocked, and can be processed.
> Threads 4, 5 and 6 are still blocked, waiting for T3 to be done
> Thread 3 finishes at t9 : we can now process the pending threads : T4, T5
> and T6
> Thread 4 and 5 are reads, they are executed concurrently
> Thread 6 is a write, it has to wait for both T4 and T5 to be done
> etc...
>
> Here, reads are executed concurrently without synchronization, but blocking
> the writes, and when a writes come in, everything is blocked until the write
> is done.
>

I've attached a hand written graphic as PDF for this scenario. Let me know
if it is correct.  More to come on scenario 2  - need to run and take care
of a few things first but will continue.


>
> Scenario 2 :
> ------------
> when a write starts, it increment a TS which is global (AtomicLong, for
> instance)
>
> case 1 :
>
>     W start   W ends
>     |         |
>     V         V
> -----o.........o----
>  ^         ^
>  |         |
>  R starts  R ends
>
> in this case, the read will grab the TS, and when it ends, the TS it grabs
> will be different, as a write has started. We can assume that the data we
> got is not valid anymore, and we redo a read.
>
> case 2 :
>
>     W start   W ends
>     |         |
>     V         V
> -----o.........o----
>  ^              ^
>  |              |
>  R starts       R ends
>
> same as case 1
>
> case 3 :
>
>     W start   W ends
>     |         |
>     V         V
> -----o.........o----
>        ^         ^
>        |         |
>        R starts  R ends
>
> Here, the TS is equal, so we may have an exception, or an invalid result.
> If we get an exception, we can simply restart the read. We have no way to
> know if the result is valid, so we assume it is, because we don't care if
> iit's not (once the user has got a result, as LDAP does dirty reads anyway,
> we are safe)
>
> case 4 :
>
>
>     W start         W ends
>     |               |
>     V               V
> -----o...............o----
>        ^         ^
>        |         |
>        R starts  R ends
>
> Same as case 3.
>
> --ends of scenarios--
>
> So we have a scenario 1 where we are totally thread safe, but where reads
> can be blocked by a write. This is 100% safe, with less contention than what
> we have right now.
>
> Scenario 2 is a bit less safe, but we don't block reads : we just redo them
> if we get an exception.
>
> Any better solution implies we implement a better underlying system, based
> on MVCC for instance.
>
> We also have to keep in mind that writes impact more than one table, and
> they should be atomic, so we must synchronize writes anyway.
>
> Now, thanks for having read this long brain dump, but I'd like to know
> what's your opinion about those two strategies, which one should we
> implement, assuming I'm not off base or that you don't have a better
> solution to propose ?
>
>
>
>
>
>
> --
> Regards,
> Cordialement,
> Emmanuel Lécharny
> www.nextury.com
>
>
>


-- 
Alex Karasulu
My Blog :: http://www.jroller.com/akarasulu/
Apache Directory Server :: http://directory.apache.org
Apache MINA :: http://mina.apache.org
To set up a meeting with me: http://tungle.me/AlexKarasulu

Re: About operation atomicity

Posted by Emmanuel Lecharny <el...@gmail.com>.
On 5/29/10 5:15 PM, Stefan Seelmann wrote:
> Hi,
>
> this discussion makes me think that we should build MVCC directly into
> XDBM. I think it should work independent of the underlying store (JDBM,
> AVL, HBase).
>
> Let me outline my idea:
>
> Instead of MasterTable<ID, Entry>  we use a MasterTable<ID,
> SortedMap<Long, Entry>>. It stores multiple versions of the entry in a
> sorted map, the key of the sorted map is the version number. Very same
> for index tables.
>
> We need some global version information which contains
> - version counter (seqence or timestamp)
> - the latest valid version
> - list of writes in progress, and
> - list of failed writes
>
> At the beginning of a write or read operations we get a snapshot of this
> global version info. This snapshot is used to get a consistent view to
> the data for the whole operation. For each read data we use the version
> that is less or equal to the "latest" version, excluding versions
> contained in the "in progress" or "failed" list. For a search operation
> this snapshot can also be used by the cursor, while fetching all results
> there would always be a conistent view to the data.
>
> A write operation works as follows:
> - Aquire the next version number, the version number is added to the
> list of "writes in progess" (begin transaction).
> - All micro-writes to index and master table are performed with this
> version.
> - All write operations add data, deletes/drops result in the addition of
> an empty value<Version, null>  (append-only).
> - For commit the version number is removed from the "writes in progress"
> list and the latest valid version is set (except it isn't already higher).
> - If the write fails then the version number is moved to the "failed
> writes" list.
>
> At some point we have to do some garbage collection and delete old
> versions. If the version is a timestamp this can be done by the next
> write by deleting versions older than X days.
>
> I think this way we can also support RFC 5808 LDAP Transactions.
>
> An big disadvantage may be performance issues as we always have to read
> and write all the versions from/to the underlying store.
>
> Thoughts?
>    
I'm afraid you will have an issue to update the Map when it contains 
many versions, as you may have more than one thread accessing this Map.

You are just moving the problem from one place to another doing so.

I'm afraid that MVCC should be implemented at the lowest level...

PS : I'm currently looking at hawtDB 
(http://fusesource.com/forge/sites/hawtdb/)

-- 
Regards,
Cordialement,
Emmanuel Lécharny
www.nextury.com



Re: About operation atomicity

Posted by Stefan Seelmann <se...@apache.org>.
Hi,

this discussion makes me think that we should build MVCC directly into
XDBM. I think it should work independent of the underlying store (JDBM,
AVL, HBase).

Let me outline my idea:

Instead of MasterTable<ID, Entry> we use a MasterTable<ID,
SortedMap<Long, Entry>>. It stores multiple versions of the entry in a
sorted map, the key of the sorted map is the version number. Very same
for index tables.

We need some global version information which contains
- version counter (seqence or timestamp)
- the latest valid version
- list of writes in progress, and
- list of failed writes

At the beginning of a write or read operations we get a snapshot of this
global version info. This snapshot is used to get a consistent view to
the data for the whole operation. For each read data we use the version
that is less or equal to the "latest" version, excluding versions
contained in the "in progress" or "failed" list. For a search operation
this snapshot can also be used by the cursor, while fetching all results
there would always be a conistent view to the data.

A write operation works as follows:
- Aquire the next version number, the version number is added to the
list of "writes in progess" (begin transaction).
- All micro-writes to index and master table are performed with this
version.
- All write operations add data, deletes/drops result in the addition of
an empty value <Version, null> (append-only).
- For commit the version number is removed from the "writes in progress"
list and the latest valid version is set (except it isn't already higher).
- If the write fails then the version number is moved to the "failed
writes" list.

At some point we have to do some garbage collection and delete old
versions. If the version is a timestamp this can be done by the next
write by deleting versions older than X days.

I think this way we can also support RFC 5808 LDAP Transactions.

An big disadvantage may be performance issues as we always have to read
and write all the versions from/to the underlying store.

Thoughts?

Kind Regards,
Stefan