You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@directory.apache.org by Selcuk AYA <ay...@gmail.com> on 2011/08/29 10:52:00 UTC

status update on jdbm work(was Re: JDBM experimental branch created)

Hi I just attached my latest changes for the jdbm branch and wanted to
give a status update and some technical details:

1) Summary
*We now have a jdbm tree which treats find, insert, remove and browse
as actions that execute in isolation. In particular, read actions will
not be affected by ongoing structural changes to the tree and will
only see data changes that completed before they started.

*We allow one writer and multiple readers to execute concurrently.
Synchronized operations are mostly removed.

* Exisiting tests except the StoredProceduteIT and the unit tests I
added for the versioned tree pass( I did mvn clean install
-Dintegration). I think the problem with StoredProceduteIT is an
existing one. There is a code piece where I serialize and deserialize
tuple values stored in JDBM btree in order to do a deep copy. With
StoredProcedureIT hello world stored procedure deserialization throws
a UTFDataFormatException. On a clean brach, I added similar code to
deserialize B+ tree page values right after they are serialized, and I
hit the same issue. So I think this is some existing issue with stored
procedure serialization/deserialization.

2) Changes above JDBM level
* I added changes to call the newly added browser->close() interface
when the cursors are closed  or a cursor is repositioned.
* I hit some existing issues where cursors are not closed. In
particular, I had to change SubentryInterceptor.java.java to close the
cursor after search operations and change the JDBM container cursor to
close the contained cursor  when it is closed. If required, I can
provide these changes as separate fixes

3) Technical details at JDBM level:

*The core functionality is at LRUCache.java. This implements a
concurrent, versioned cache. There a power of two hash buckets and a
lock for each of the 8 buckets(lock striping). Number of hash buckets
x where x is closest power of two such that x < max number of cache
entries.

*Cache replacement policy is LRU. There are 16 lru lists and each
cache entry is assigned to one of the lru lists. Each lru is protected
by a separate lock. LRU replacement is supposed to be fast. Threads
choose an lru based on an lru randomizer. Since replacement is
supposed to be fast and each thread randomly chooses an lru to replace
from, lru operations should not be a bottleneck.

* Each cache entry has a [startVersion, endVersion) where it is valid.
At any time, a hash bucket chain looks like this:
 (key1, startVersion11, endVersion11) <-> (key2, startversion21,
endVersion21) <->
           |                                                                  |
(key1, startVersion12, endVersion12)         (key2, startversion22,
endVersion22)
           |                                                                  |
        ....                                                           .......

that is, there is a primary chain where entries for different keys are
chained and then there is subchain where different versions for a
given key are held. So, when readers search for a (key, version), they
first walk the primary chain and then walk the subchain to find their
entry.

*As writes create previous versions of entries, they use part of the
cache to store them. The rule is that such an entry cannot be replaced
as long as there might be a reader which might read it. We keep track
of the minimum read action version to make such entries replaceable .

*As writes create previous versions of entries, they use part of the
cache to store them. If there are long browse operations and quite a
bit of updates going on at the same time, we might run into a case
where most of the cache entries are used to store previous versions.
We might even have a case  where all entries store previous versions
and they cannot be replaced(because of the rule above). In this case,
writers wait for a freeable cache entry. When a reader cannot find a
replaceable entry, it  does read from disk while holding the bucket
latch(and thus holding any possible writer on the same location). and
return the entry to the user without populating the cache and thus
without looking for a replaceable cache entry.  Since readers always
make progress, min read version will eventually advance and writers
will progress too. Normally, when readers or writers do IO, they
release the hash latch.

* There some helper classes for the LRUCache to work. Maybe the most
interesting ones are ActionVersioning which uses AtomicInteger and
AtomicReference and is optimized for the read mostly case. Also we
have ExplicitList where remove operations are O(1) given an
element(this is in contrast to O(n) remove given a pointer to an
element on Java's lists). Such fast removes are needed for lru
algorithm.

*When (key,value) pairs are added to the Btree or when they are
retrieved, Btree does a deep copy of the value(through serialization,
deserialization). This is needed so that Btree can store previous
versions of values. I assumed key stored in Btrees are not changed. If
the do, even the CacheRecordManager currently in use wouldnt work.

4) Possible improvements:
*if most of the cache entries are used to store previous versions,
cache effectiveness will decrease.A solultion is to start spilling
previous versions to disk when such a thing happens. The subchain we
talked about above would have to be spilled to disk. However, this is
only a performance problem and is a corner case one as well if it is
true that ldap is read mostly.

* Currently when a write action is executing, if there is an IO
exception action is aborted and  I do not advance the read version and
thus readers do not see the affects of the aborted action. However, it
seems that upper layers do not do enough cleanup in this case, they
continue using the jdbm stores and this will lead to inconsistency. A
good thing would be to rollback all the dirty changes . Also, jdbm
txns are not enable currently so a crash in the middle of syncing
might leave the store inconsistent.

5) TODO:
*add some more test cases for the verisioned btree to test corner cases.
*I am not very willing to implement disk spilling since this is only a
performance improvement needed in corner cases if stores are mostly
read only. But if you guys think this is really necessary, I might
look into this as well.


regards,
Selcuk Aya

Re: status update on jdbm work(was Re: JDBM experimental branch created)

Posted by Emmanuel Lécharny <el...@apache.org>.

On 8/29/11 2:09 PM, Selcuk AYA wrote:
>
>> Right now, I consider that it's just an improvement. What is important atm
>> is to have a working server. That means we need more tests, and some wide
>> ones (ie, with many clients injecting read and write requests, run for some
>> long period of time).
> for testing purposes, do we have a benchmark kind of workload that is
> representative of a typical ldap workload?

We had one, but it requires a lot of machines we don't anymore have.

I'm currently preparing a presentation on LDAP related tests for a 
conference, so I will work on building an enviroment for measuring the 
performances we get with the different versions we now have.
I'll get the list updated.

-- 
Regards,
Cordialement,
Emmanuel Lécharny
www.iktek.com

Re: status update on jdbm work(was Re: JDBM experimental branch created)

Posted by Selcuk AYA <ay...@gmail.com>.

On Mon, Aug 29, 2011 at 1:19 PM, Alex Karasulu <ak...@apache.org> wrote:
> Oooops sorry just saw this thread. I responded to the other thread. Sorry
> about this. Will switch to this one.
my bad. I thought my email did not make it and I sent multiple times.
>
> On Mon, Aug 29, 2011 at 12:55 PM, Emmanuel Lecharny <el...@gmail.com>
> wrote:
>>
>> Hi Selcuk,
>>
>> On 8/29/11 10:52 AM, Selcuk AYA wrote:
>>>
>>> Hi I just attached my latest changes for the jdbm branch and wanted to
>>> give a status update and some technical details:
>>
>> Many thanks !
>>>
>>> 1) Summary
>>> *We now have a jdbm tree which treats find, insert, remove and browse
>>> as actions that execute in isolation. In particular, read actions will
>>> not be affected by ongoing structural changes to the tree and will
>>> only see data changes that completed before they started.
>>
>> Just great !
>>>
>>> *We allow one writer and multiple readers to execute concurrently.
>>> Synchronized operations are mostly removed.
>>
>> Woot ! We need to do some perf tests to see what kind of performances
>> improvement it brings...
>>>
>>> * Exisiting tests except the StoredProceduteIT and the unit tests I
>>> added for the versioned tree pass( I did mvn clean install
>>> -Dintegration). I think the problem with StoredProceduteIT is an
>>> existing one. There is a code piece where I serialize and deserialize
>>> tuple values stored in JDBM btree in order to do a deep copy. With
>>> StoredProcedureIT hello world stored procedure deserialization throws
>>> a UTFDataFormatException. On a clean brach, I added similar code to
>>> deserialize B+ tree page values right after they are serialized, and I
>>> hit the same issue. So I think this is some existing issue with stored
>>> procedure serialization/deserialization.
>>
>> I gonna review the ignored test.
>>>
>>> 2) Changes above JDBM level
>>> * I added changes to call the newly added browser->close() interface
>>> when the cursors are closed  or a cursor is repositioned.
>>> * I hit some existing issues where cursors are not closed. In
>>> particular, I had to change SubentryInterceptor.java.java to close the
>>> cursor after search operations and change the JDBM container cursor to
>>> close the contained cursor  when it is closed. If required, I can
>>> provide these changes as separate fixes
>>
>> Yeah, just create a new JIRA for this one, and attach new patches.
Will do.

>>>
>>> 3) Technical details at JDBM level:
>>>
>>> *The core functionality is at LRUCache.java. This implements a
>>> concurrent, versioned cache. There a power of two hash buckets and a
>>> lock for each of the 8 buckets(lock striping). Number of hash buckets
>>> x where x is closest power of two such that x<  max number of cache
>>> entries.
>>
>> Ok.
>>>
>>> *Cache replacement policy is LRU. There are 16 lru lists and each
>>> cache entry is assigned to one of the lru lists. Each lru is protected
>>> by a separate lock. LRU replacement is supposed to be fast. Threads
>>> choose an lru based on an lru randomizer. Since replacement is
>>> supposed to be fast and each thread randomly chooses an lru to replace
>>> from, lru operations should not be a bottleneck.
>>
>> Do we need to make this number (16) configurable ?
Not needed I think. Even big SMP machines are around 32 cpus.

>>>
>>> * Each cache entry has a [startVersion, endVersion) where it is valid.
>>> At any time, a hash bucket chain looks like this:
>>>  (key1, startVersion11, endVersion11)<->  (key2, startversion21,
>>> endVersion21)<->
>>>            |
>>>      |
>>> (key1, startVersion12, endVersion12)         (key2, startversion22,
>>> endVersion22)
>>>            |
>>>      |
>>>         ....
>>> .......
>>>
>>> that is, there is a primary chain where entries for different keys are
>>> chained and then there is subchain where different versions for a
>>> given key are held. So, when readers search for a (key, version), they
>>> first walk the primary chain and then walk the subchain to find their
>>> entry.
>>
>> The schema is not rendered correctly in the mail, but it's not a big deal.
>> I'm not sure I grok your explanation though. I need to re-read it later...
>>>
>>> *As writes create previous versions of entries,
>>
>> You mean 'new versions', right ? Or maybe what you mean is that once you
>> modify an existing entry, the previous version of this entry is stored into
>> the cache ?
>>>
As writes change an existing entry, they create a newer version of it
and store older version in cache.

>>> they use part of the
>>> cache to store them. The rule is that such an entry cannot be replaced
>>> as long as there might be a reader which might read it. We keep track
>>> of the minimum read action version to make such entries replaceable .
>>>
>>> *As writes create previous versions of entries, they use part of the
>>> cache to store them. If there are long browse operations and quite a
>>> bit of updates going on at the same time, we might run into a case
>>> where most of the cache entries are used to store previous versions.
>>
>> Ok.
>>>
>>> We might even have a case  where all entries store previous versions
>>> and they cannot be replaced(because of the rule above). In this case,
>>> writers wait for a freeable cache entry.
>>
>> Ok. We could probably chose to keep the old versions on disk, to avoid
>> having to hang a write, but for a first version, I think it's an acceptable
>> solution.
>>>
>>>  When a reader cannot find a
>>> replaceable entry, it  does read from disk while holding the bucket
>>> latch(and thus holding any possible writer on the same location). and
>>> return the entry to the user without populating the cache and thus
>>> without looking for a replaceable cache entry.  Since readers always
>>> make progress, min read version will eventually advance and writers
>>> will progress too. Normally, when readers or writers do IO, they
>>> release the hash latch.
>>
>> Ok.
>>>
>>> * There some helper classes for the LRUCache to work. Maybe the most
>>> interesting ones are ActionVersioning which uses AtomicInteger and
>>> AtomicReference and is optimized for the read mostly case. Also we
>>> have ExplicitList where remove operations are O(1) given an
>>> element(this is in contrast to O(n) remove given a pointer to an
>>> element on Java's lists). Such fast removes are needed for lru
>>> algorithm.
>>
>> Ok
>>>
>>> *When (key,value) pairs are added to the Btree or when they are
>>> retrieved, Btree does a deep copy of the value(through serialization,
>>> deserialization). This is needed so that Btree can store previous
>>> versions of values. I assumed key stored in Btrees are not changed. If
>>> the do, even the CacheRecordManager currently in use wouldnt work.
>>
>> Do you mean we never delete anything from the BTree ?

JDBM B tree pages hold references to values and keys provided by the
upper layers when the page is in memory. Say I had page B1  with key,
value pair key1, value1. Now some writer, does a fetch for key1,
modifies value1 to value2. Since B1 held a reference to the value, its
value changed to key1, value2 as well. Now the writer does a put to
the Btree layer and the Btree layer copies on write page B1 hoping to
store the previous version of the page and the value but its key1
value has already been modified to value2. In order to avoid such
situations, we give the writer a copy of value1 and the writer
modifies the copy and hence the deep copy we do through
serialization/deserialization.

>>>
>>> 4) Possible improvements:
>>> *if most of the cache entries are used to store previous versions,
>>> cache effectiveness will decrease.A solultion is to start spilling
>>> previous versions to disk when such a thing happens. The subchain we
>>> talked about above would have to be spilled to disk. However, this is
>>> only a performance problem and is a corner case one as well if it is
>>> true that ldap is read mostly.
>>
>> Yes. See above one of my comment.
As Alex suggested, it might be good to first see how this thing
behaves with a typical ldap workload.

>>>
>>> * Currently when a write action is executing, if there is an IO
>>> exception action is aborted and  I do not advance the read version and
>>> thus readers do not see the affects of the aborted action. However, it
>>> seems that upper layers do not do enough cleanup in this case, they
>>> continue using the jdbm stores and this will lead to inconsistency. A
>>> good thing would be to rollback all the dirty changes . Also, jdbm
>>> txns are not enable currently so a crash in the middle of syncing
>>> might leave the store inconsistent.
>>
>> I guess we have to address those issues. Crashes can occur when we get a
>> file-system full, or a defective disk. In both cases, I think having a crash
>> recovery system should be enough, as a first approach. In any case, I guess
>> that the server has to be started when it occurs, so that some corrective
>> actions can be done before the backend is servered any more...

We should considering enabling transactions of JDBM I think.

>>>
>>> 5) TODO:
>>> *add some more test cases for the verisioned btree to test corner cases.
>>> *I am not very willing to implement disk spilling since this is only a
>>> performance improvement needed in corner cases if stores are mostly
>>> read only. But if you guys think this is really necessary, I might
>>> look into this as well.
>>
>> Right now, I consider that it's just an improvement. What is important atm
>> is to have a working server. That means we need more tests, and some wide
>> ones (ie, with many clients injecting read and write requests, run for some
>> long period of time).

for testing purposes, do we have a benchmark kind of workload that is
representative of a typical ldap workload?

>>
>> Selcuk, this is an excellent work ! I'm going to apply your last changes
>> to the repo. Many thanks !
>>
>>
>> --
>> Regards,
>> Cordialement,
>> Emmanuel Lécharny
>> www.iktek.com
>>
>
>
>
> --
> Best Regards,
> -- Alex
>

Re: status update on jdbm work(was Re: JDBM experimental branch created)

Posted by Alex Karasulu <ak...@apache.org>.

Oooops sorry just saw this thread. I responded to the other thread. Sorry
about this. Will switch to this one.

On Mon, Aug 29, 2011 at 12:55 PM, Emmanuel Lecharny <el...@gmail.com>wrote:

> Hi Selcuk,
>
>
> On 8/29/11 10:52 AM, Selcuk AYA wrote:
>
>> Hi I just attached my latest changes for the jdbm branch and wanted to
>> give a status update and some technical details:
>>
> Many thanks !
>
>
>> 1) Summary
>> *We now have a jdbm tree which treats find, insert, remove and browse
>> as actions that execute in isolation. In particular, read actions will
>> not be affected by ongoing structural changes to the tree and will
>> only see data changes that completed before they started.
>>
> Just great !
>
>
>> *We allow one writer and multiple readers to execute concurrently.
>> Synchronized operations are mostly removed.
>>
> Woot ! We need to do some perf tests to see what kind of performances
> improvement it brings...
>
>
>> * Exisiting tests except the StoredProceduteIT and the unit tests I
>> added for the versioned tree pass( I did mvn clean install
>> -Dintegration). I think the problem with StoredProceduteIT is an
>> existing one. There is a code piece where I serialize and deserialize
>> tuple values stored in JDBM btree in order to do a deep copy. With
>> StoredProcedureIT hello world stored procedure deserialization throws
>> a UTFDataFormatException. On a clean brach, I added similar code to
>> deserialize B+ tree page values right after they are serialized, and I
>> hit the same issue. So I think this is some existing issue with stored
>> procedure serialization/deserialization.
>>
> I gonna review the ignored test.
>
>
>> 2) Changes above JDBM level
>> * I added changes to call the newly added browser->close() interface
>> when the cursors are closed  or a cursor is repositioned.
>> * I hit some existing issues where cursors are not closed. In
>> particular, I had to change SubentryInterceptor.java.java to close the
>> cursor after search operations and change the JDBM container cursor to
>> close the contained cursor  when it is closed. If required, I can
>> provide these changes as separate fixes
>>
> Yeah, just create a new JIRA for this one, and attach new patches.
>
>
>> 3) Technical details at JDBM level:
>>
>> *The core functionality is at LRUCache.java. This implements a
>> concurrent, versioned cache. There a power of two hash buckets and a
>> lock for each of the 8 buckets(lock striping). Number of hash buckets
>> x where x is closest power of two such that x<  max number of cache
>> entries.
>>
> Ok.
>
>
>> *Cache replacement policy is LRU. There are 16 lru lists and each
>> cache entry is assigned to one of the lru lists. Each lru is protected
>> by a separate lock. LRU replacement is supposed to be fast. Threads
>> choose an lru based on an lru randomizer. Since replacement is
>> supposed to be fast and each thread randomly chooses an lru to replace
>> from, lru operations should not be a bottleneck.
>>
> Do we need to make this number (16) configurable ?
>
>
>> * Each cache entry has a [startVersion, endVersion) where it is valid.
>> At any time, a hash bucket chain looks like this:
>>  (key1, startVersion11, endVersion11)<->  (key2, startversion21,
>> endVersion21)<->
>>            |
>>    |
>> (key1, startVersion12, endVersion12)         (key2, startversion22,
>> endVersion22)
>>            |
>>    |
>>         ....
>> .......
>>
>> that is, there is a primary chain where entries for different keys are
>> chained and then there is subchain where different versions for a
>> given key are held. So, when readers search for a (key, version), they
>> first walk the primary chain and then walk the subchain to find their
>> entry.
>>
> The schema is not rendered correctly in the mail, but it's not a big deal.
> I'm not sure I grok your explanation though. I need to re-read it later...
>
>
>> *As writes create previous versions of entries,
>>
> You mean 'new versions', right ? Or maybe what you mean is that once you
> modify an existing entry, the previous version of this entry is stored into
> the cache ?
>
>  they use part of the
>> cache to store them. The rule is that such an entry cannot be replaced
>> as long as there might be a reader which might read it. We keep track
>> of the minimum read action version to make such entries replaceable .
>>
>> *As writes create previous versions of entries, they use part of the
>> cache to store them. If there are long browse operations and quite a
>> bit of updates going on at the same time, we might run into a case
>> where most of the cache entries are used to store previous versions.
>>
>
> Ok.
>
>  We might even have a case  where all entries store previous versions
>> and they cannot be replaced(because of the rule above). In this case,
>> writers wait for a freeable cache entry.
>>
> Ok. We could probably chose to keep the old versions on disk, to avoid
> having to hang a write, but for a first version, I think it's an acceptable
> solution.
>
>   When a reader cannot find a
>> replaceable entry, it  does read from disk while holding the bucket
>> latch(and thus holding any possible writer on the same location). and
>> return the entry to the user without populating the cache and thus
>> without looking for a replaceable cache entry.  Since readers always
>> make progress, min read version will eventually advance and writers
>> will progress too. Normally, when readers or writers do IO, they
>> release the hash latch.
>>
> Ok.
>
>
>> * There some helper classes for the LRUCache to work. Maybe the most
>> interesting ones are ActionVersioning which uses AtomicInteger and
>> AtomicReference and is optimized for the read mostly case. Also we
>> have ExplicitList where remove operations are O(1) given an
>> element(this is in contrast to O(n) remove given a pointer to an
>> element on Java's lists). Such fast removes are needed for lru
>> algorithm.
>>
> Ok
>
>
>> *When (key,value) pairs are added to the Btree or when they are
>> retrieved, Btree does a deep copy of the value(through serialization,
>> deserialization). This is needed so that Btree can store previous
>> versions of values. I assumed key stored in Btrees are not changed. If
>> the do, even the CacheRecordManager currently in use wouldnt work.
>>
> Do you mean we never delete anything from the BTree ?
>
>
>> 4) Possible improvements:
>> *if most of the cache entries are used to store previous versions,
>> cache effectiveness will decrease.A solultion is to start spilling
>> previous versions to disk when such a thing happens. The subchain we
>> talked about above would have to be spilled to disk. However, this is
>> only a performance problem and is a corner case one as well if it is
>> true that ldap is read mostly.
>>
> Yes. See above one of my comment.
>
>
>> * Currently when a write action is executing, if there is an IO
>> exception action is aborted and  I do not advance the read version and
>> thus readers do not see the affects of the aborted action. However, it
>> seems that upper layers do not do enough cleanup in this case, they
>> continue using the jdbm stores and this will lead to inconsistency. A
>> good thing would be to rollback all the dirty changes . Also, jdbm
>> txns are not enable currently so a crash in the middle of syncing
>> might leave the store inconsistent.
>>
> I guess we have to address those issues. Crashes can occur when we get a
> file-system full, or a defective disk. In both cases, I think having a crash
> recovery system should be enough, as a first approach. In any case, I guess
> that the server has to be started when it occurs, so that some corrective
> actions can be done before the backend is servered any more...
>
>
>> 5) TODO:
>> *add some more test cases for the verisioned btree to test corner cases.
>> *I am not very willing to implement disk spilling since this is only a
>> performance improvement needed in corner cases if stores are mostly
>> read only. But if you guys think this is really necessary, I might
>> look into this as well.
>>
> Right now, I consider that it's just an improvement. What is important atm
> is to have a working server. That means we need more tests, and some wide
> ones (ie, with many clients injecting read and write requests, run for some
> long period of time).
>
> Selcuk, this is an excellent work ! I'm going to apply your last changes to
> the repo. Many thanks !
>
>
> --
> Regards,
> Cordialement,
> Emmanuel Lécharny
> www.iktek.com
>
>


-- 
Best Regards,
-- Alex

Re: status update on jdbm work(was Re: JDBM experimental branch created)

Posted by Emmanuel Lecharny <el...@gmail.com>.

Hi Selcuk,

On 8/29/11 10:52 AM, Selcuk AYA wrote:
> Hi I just attached my latest changes for the jdbm branch and wanted to
> give a status update and some technical details:
Many thanks !
>
> 1) Summary
> *We now have a jdbm tree which treats find, insert, remove and browse
> as actions that execute in isolation. In particular, read actions will
> not be affected by ongoing structural changes to the tree and will
> only see data changes that completed before they started.
Just great !
>
> *We allow one writer and multiple readers to execute concurrently.
> Synchronized operations are mostly removed.
Woot ! We need to do some perf tests to see what kind of performances 
improvement it brings...
>
> * Exisiting tests except the StoredProceduteIT and the unit tests I
> added for the versioned tree pass( I did mvn clean install
> -Dintegration). I think the problem with StoredProceduteIT is an
> existing one. There is a code piece where I serialize and deserialize
> tuple values stored in JDBM btree in order to do a deep copy. With
> StoredProcedureIT hello world stored procedure deserialization throws
> a UTFDataFormatException. On a clean brach, I added similar code to
> deserialize B+ tree page values right after they are serialized, and I
> hit the same issue. So I think this is some existing issue with stored
> procedure serialization/deserialization.
I gonna review the ignored test.
>
> 2) Changes above JDBM level
> * I added changes to call the newly added browser->close() interface
> when the cursors are closed  or a cursor is repositioned.
> * I hit some existing issues where cursors are not closed. In
> particular, I had to change SubentryInterceptor.java.java to close the
> cursor after search operations and change the JDBM container cursor to
> close the contained cursor  when it is closed. If required, I can
> provide these changes as separate fixes
Yeah, just create a new JIRA for this one, and attach new patches.
>
> 3) Technical details at JDBM level:
>
> *The core functionality is at LRUCache.java. This implements a
> concurrent, versioned cache. There a power of two hash buckets and a
> lock for each of the 8 buckets(lock striping). Number of hash buckets
> x where x is closest power of two such that x<  max number of cache
> entries.
Ok.
>
> *Cache replacement policy is LRU. There are 16 lru lists and each
> cache entry is assigned to one of the lru lists. Each lru is protected
> by a separate lock. LRU replacement is supposed to be fast. Threads
> choose an lru based on an lru randomizer. Since replacement is
> supposed to be fast and each thread randomly chooses an lru to replace
> from, lru operations should not be a bottleneck.
Do we need to make this number (16) configurable ?
>
> * Each cache entry has a [startVersion, endVersion) where it is valid.
> At any time, a hash bucket chain looks like this:
>   (key1, startVersion11, endVersion11)<->  (key2, startversion21,
> endVersion21)<->
>             |                                                                  |
> (key1, startVersion12, endVersion12)         (key2, startversion22,
> endVersion22)
>             |                                                                  |
>          ....                                                           .......
>
> that is, there is a primary chain where entries for different keys are
> chained and then there is subchain where different versions for a
> given key are held. So, when readers search for a (key, version), they
> first walk the primary chain and then walk the subchain to find their
> entry.
The schema is not rendered correctly in the mail, but it's not a big 
deal. I'm not sure I grok your explanation though. I need to re-read it 
later...
>
> *As writes create previous versions of entries,
You mean 'new versions', right ? Or maybe what you mean is that once you 
modify an existing entry, the previous version of this entry is stored 
into the cache ?
> they use part of the
> cache to store them. The rule is that such an entry cannot be replaced
> as long as there might be a reader which might read it. We keep track
> of the minimum read action version to make such entries replaceable .
>
> *As writes create previous versions of entries, they use part of the
> cache to store them. If there are long browse operations and quite a
> bit of updates going on at the same time, we might run into a case
> where most of the cache entries are used to store previous versions.

Ok.
> We might even have a case  where all entries store previous versions
> and they cannot be replaced(because of the rule above). In this case,
> writers wait for a freeable cache entry.
Ok. We could probably chose to keep the old versions on disk, to avoid 
having to hang a write, but for a first version, I think it's an 
acceptable solution.
>   When a reader cannot find a
> replaceable entry, it  does read from disk while holding the bucket
> latch(and thus holding any possible writer on the same location). and
> return the entry to the user without populating the cache and thus
> without looking for a replaceable cache entry.  Since readers always
> make progress, min read version will eventually advance and writers
> will progress too. Normally, when readers or writers do IO, they
> release the hash latch.
Ok.
>
> * There some helper classes for the LRUCache to work. Maybe the most
> interesting ones are ActionVersioning which uses AtomicInteger and
> AtomicReference and is optimized for the read mostly case. Also we
> have ExplicitList where remove operations are O(1) given an
> element(this is in contrast to O(n) remove given a pointer to an
> element on Java's lists). Such fast removes are needed for lru
> algorithm.
Ok
>
> *When (key,value) pairs are added to the Btree or when they are
> retrieved, Btree does a deep copy of the value(through serialization,
> deserialization). This is needed so that Btree can store previous
> versions of values. I assumed key stored in Btrees are not changed. If
> the do, even the CacheRecordManager currently in use wouldnt work.
Do you mean we never delete anything from the BTree ?
>
> 4) Possible improvements:
> *if most of the cache entries are used to store previous versions,
> cache effectiveness will decrease.A solultion is to start spilling
> previous versions to disk when such a thing happens. The subchain we
> talked about above would have to be spilled to disk. However, this is
> only a performance problem and is a corner case one as well if it is
> true that ldap is read mostly.
Yes. See above one of my comment.
>
> * Currently when a write action is executing, if there is an IO
> exception action is aborted and  I do not advance the read version and
> thus readers do not see the affects of the aborted action. However, it
> seems that upper layers do not do enough cleanup in this case, they
> continue using the jdbm stores and this will lead to inconsistency. A
> good thing would be to rollback all the dirty changes . Also, jdbm
> txns are not enable currently so a crash in the middle of syncing
> might leave the store inconsistent.
I guess we have to address those issues. Crashes can occur when we get a 
file-system full, or a defective disk. In both cases, I think having a 
crash recovery system should be enough, as a first approach. In any 
case, I guess that the server has to be started when it occurs, so that 
some corrective actions can be done before the backend is servered any 
more...
>
> 5) TODO:
> *add some more test cases for the verisioned btree to test corner cases.
> *I am not very willing to implement disk spilling since this is only a
> performance improvement needed in corner cases if stores are mostly
> read only. But if you guys think this is really necessary, I might
> look into this as well.
Right now, I consider that it's just an improvement. What is important 
atm is to have a working server. That means we need more tests, and some 
wide ones (ie, with many clients injecting read and write requests, run 
for some long period of time).

Selcuk, this is an excellent work ! I'm going to apply your last changes 
to the repo. Many thanks !


-- 
Regards,
Cordialement,
Emmanuel Lécharny
www.iktek.com