You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@directory.apache.org by Selcuk AYA <ay...@gmail.com> on 2011/10/06 22:45:06 UTC

summary of txn discussion with Emmanuel

Hi All,
we had a private conversation with Emmanuel on alternatives to
implement the a transactional system and here is a summary of it. Feel
free to comment and let us what you think:

****** We basically discussed two ways to implement transactions with
MVCC consistency guarantees for the readers (I guess we all agree on
MVCC):
*Keep previous version of entries and index entries in partition. This
would require us to make each partition version aware. Note that the
versioning is at logical level. Lets say we have two versions of an
entry where entry1 satisfies reads between version 5-8 and entry2
satisfies versions 8-infinity. And we have a reader at version 7 and
we do a master table lookup for this entry. Then we either need to
fetch all entries with a single lookup and find the entry that matches
our version or follow a backward chain from the most recent entry to
find the version we want. The second option is what postgres does.

Also note that since the versioning is on a logical basis  cleaning
data by reusing entries or index values is not so easy. So we would
probably need a garbage collector. Getting the performance of garbage
collector right could be a problem

In order to avoid garbage collector and high cost of finding previous
versions of entries and index values, we could keep the previous
versions in memory as long as a reader needs them.

My take on this is that, doing versioning at a phyisal page level with
hierarchical data structures like Btree works well because as long as
you keep the pointer to the old version of the root, you can find the
previous versions of the pages easily, and you can reuse the pages
when they are not needed(or compact the file after a while). But with
versioning at logical level, you probably need garbage collector and
finding the version needed is  costly if you do not keep previous
versions in memory.

*Using write ahead log for mvcc. This has the overhead of managing the
write ahead log. However, mvcc consistency and transaction recovery is
not the only thing a write ahead log could be needed for. If we do not
have a write ahead log, then a separate logging solution for other
problems is needed. For example, at ApahceDS we have:
          -change log
          -journal
          -consumer log for replication

If we have a write ahead log that is aware of transactions across
partitions, then it can be useful for the above problems and you can
also:
     - make replication transaction aware. Especially useful if
transactions are not just single ldap modifcation requests but stored
procedures or triggers.
      - rollback all your server to a consistent point in time using the log.

**** We also talked about having an mvcc backend. If this way is
chosen, then we will probably have txns per partition. This is what
ldap servers that I am aware of do. However, with this, it would be
difficult to have cross partition modifications and we would need to
do additional work for the existing jdbm partition and upcoming HBASE
partition. Also we would still need logical logging to implement some
of the things mentioned above.

please feel free to comment,

regards,
Selcuk