You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@directory.apache.org by Emmanuel Lécharny <el...@gmail.com> on 2018/02/11 17:43:33 UTC

[ApacheDS] Transaction support

Hi guys,

a quick head up about this on-going effort.

First of all, I added the Transaction extended operation in the LDAP
API, but this is somehow orthogonal. We don't really need that at the
moment in the server, but we will most certainly leverage it later for
some interesting feature (see later).

At the moment, the idea is to add cross- B-tree transaction to
partitions (at least JDBM/Mavibot partitions). This is critical because
it will fix the corruption problem we have.

The idea is to start a transaction in the OperationManager, either re	ad
or write dependening on the operation. We can have many read operations
going on, but only one write operation (for the JDBM partition, we will
have some more constraints).

Transactions have to be started by partitions, as the upper layer (ie
operation manager has no way to know how the lower level (ie partitions)
deal with transactions. This is possible because we can determinate
which partition we are addressing using the operation's DN. The first
thing to do is to move this part from partitions to the operation manager.

Then we have to propagate the txn down to the nexus. The easier way is
to pass the txn into the OperationContext instances. Once we went
through the interceptors down to the Nexus partition, we have to apply
the operation to the specific partition, and this is done by the
AbstractBtreePartition, mostly. Enough said that each B-tree update
needs to know about the txn, so we have to modify the basic partition
operations to take an extra parameter : the txn. And this is where it
starts to be hard... Because that implies we also have to extend the
following interfaces :

- Table
- Index
- Cursor

so that they also take this txn as a parameter.

It's a bit of a gigantic change in the interfaces... Note that we don't
want to change the LDAP API cursor interface, too.

Then , we have to change teh way teh JDBM partition behave. Currently,
we create a RecordManager for each B-tree (ie, each Table, which may
have 2 B-tree, the forward and reverse index). This is not good, because
we can't apply a global operation across many recordManager using a txn.
The JDBM transactionManager is applied to a single RecordManager. So the
next big change is to have JDBM working using one single file. That
impacts the initialisation of JDBM index and partition.

For mavibot, it's simpler, because we already use one single file anyway.

ATM, I have made the required changes in the Partition/Index/Cursor to
pass this extrac txn paramer, and I'm dealing with the unique file
definition. The LDIF Partition is working just fine with all those
changes, and many of the JDBM tests are also running green.

The last thing to do is for JDBM to make sure we don't have a collision
between reads and write (ie we can't read when we write). That will slow
down the server when we do a write, but JDBM won't be able to deal with
concurrent reads and writes anyway. Mavibot does not have such
limitation, so I expect that Mavibot will become teh de-facto backend
soon after those changes.

LAst, not least, being able to leverage the Transaction extended
operation will allow a fast loading of the server, especially during the
initial injection of data: we can do that in memory, and flush teh
result globally.

I expect all those changes to take a couple of weeks - working on
evenings and week-end).



I'll keep you posted anyway.

-- 
Emmanuel Lecharny

Symas.com
directory.apache.org


Re: Partly done : [ApacheDS] Transaction support

Posted by Lucas Theisen <lu...@pastdev.com>.
Did you have something specific in mind?  Or were you just thinking I could
run the new version in my device environment and look for failures?  I'll
do what I can in either case...  Let me know what you were thinking

On Feb 26, 2018 9:19 AM, "Emmanuel Lécharny" <el...@gmail.com> wrote:

>
>
> Le 26/02/2018 à 14:42, Lucas Theisen a écrit :
> > This is excellent news.  Great work on this, I can't wait to be able to
> use
> > an instance without worrying about corruption.  Thank you for your work
> on
> > this!
>
> Thanks Lucas !
>
> Actually, I'll need some 'guinea pig' to check that corruption does not
> occur. That means some heacy tests with concuttent writes and searchs.
>
> Will you be able to participate ?
>
> Thanks !
>
> --
> Emmanuel Lecharny
>
> Symas.com
> directory.apache.org
>
>

Re: Partly done : [ApacheDS] Transaction support

Posted by Emmanuel Lécharny <el...@gmail.com>.

Le 26/02/2018 à 14:42, Lucas Theisen a écrit :
> This is excellent news.  Great work on this, I can't wait to be able to use
> an instance without worrying about corruption.  Thank you for your work on
> this!

Thanks Lucas !

Actually, I'll need some 'guinea pig' to check that corruption does not
occur. That means some heacy tests with concuttent writes and searchs.

Will you be able to participate ?

Thanks !

-- 
Emmanuel Lecharny

Symas.com
directory.apache.org


Re: Partly done : [ApacheDS] Transaction support

Posted by Lucas Theisen <lu...@pastdev.com>.
This is excellent news.  Great work on this, I can't wait to be able to use
an instance without worrying about corruption.  Thank you for your work on
this!

On Feb 26, 2018 5:27 AM, "Emmanuel Lecharny" <el...@apache.org> wrote:

> Hi guys,
>
> I'm done with the changes in the server : all the tests are now passing
> green.
>
> That means we have a JDBM based server with operation being atomic. That
> should solve the database corruption we have.
>
> In the process, I also changed the way the API handle requests/responses,
> modifying the operation Future, removing queues that were used expect for
> operations that could take more than one response (Bind, Search and
> ExtendedOperation). We now use a simple wait/notify solution that works
> well. The Cursor implementation has also been slightly modified to remove
> the monitor in it, as it was not sued. Last not least, the operation
> Future.isDone() method is now returning 'true' if we have got an operation
> response. This is important for asunc implementation as it's a not blocking
> way to know if we can go on and fetch the operation (if it's not abandoned,
> of course).
>
> Otherwise, I also have removed a useless feature in the JDBM Parttition :
> we were storing the number of elements in a B-tree beside the B-tree, which
> is totally spurious as teh B-tree already knows about this number : tehre
> is a btree.size() method. This save a few writes.
>
> Overall, as I said, the server is 20% faster, and this is due to saved
> writes and to changes in the way the  LDAP API handles values.
>
> I still have some cleanup to do, but I think we are on a verge of a
> release of teh API and the server, which will also leads to a release of
> Studio :-)
>
> On 2018/02/21 22:01:38, Emmanuel Lecharny <el...@apache.org> wrote:
> > Hi,
> >
> > a quick update on the on-going work (last week-end and last night).
> >
> > Most of the core-integ tests are passing now (except 3 errors and 4
> failures). FTR, there are 335 instances of OperationContext to check in the
> whole code, many of them using a readTransaction (149).
> >
> > I'm facing 2 issues atm :
> > - first, the Search operation must not close the transaction until the
> associated cursor is closed. This is not currently properly done. The best
> solution I foresee is to store the transaction in the Cursor, and to close
> it when the cursor is closed. That would imply we may face some issue if
> the cursor remains open for a long time.
> > - second, as we don't have a file per index anymore, the JdbmPartition
> initialization is now problematic : we are checking the presence of such
> files on disk to see if we have to re-create the missing indexes. IMO, this
> is a wrong thing to do : we should check the configuration and create the
> index accordingly if the index aren't present in the database. That also
> mean we should check in the database, something we currently don't do. This
> is something I have to fix, it's blocking some of the tests (and likely the
> server won't restart, too).
> >
> > Otherwise, I have checked all the operations, before the
> OperationManager and after it, the transaction seems to be properly
> initiated and committed or aborted.
> >
> > I'm not far from something that works...
> >
> > More later.
> >
> > On 2018/02/17 17:26:05, Emmanuel Lécharny <el...@gmail.com> wrote:
> > > Hi,
> > >
> > > this is an update on the on-going work.
> > >
> > > I have started to add support of transaction in JDBM first, using the
> > > changes I have applied last week.
> > >
> > > Actually, JDBM supports transaction natively (to some extend), and teh
> > > big mistake we have done was to limit its use on a per-index/table
> > > level. Which means each index and the master table are updated within
> > > its own transaction, which is clearly not good when we have to make
> each
> > > LDAP operation atomic.
> > >
> > > So the first big change is that I use one single transaction manager
> > > across all the index and the master table : we commit an operation
> > > globally. I do expect that it will make it safer (ie, no concurrent
> > > updates of indexes, leading to some database corruption).
> > >
> > > A typical usage looks like (in OperationManager) :
> > >
> > > public void add( AddOperationContext addContext ) throws LdapException
> > > {
> > >     ...
> > >     // Find the working partition
> > >     Partition partition =
> > > directoryService.getPartitionNexus().getPartition( dn );
> > >     addContext.setPartition( partition );
> > >     ...
> > >
> > >     lockWrite();
> > >
> > >     // Start a Write transaction right away
> > >     PartitionTxn transaction = null;
> > >
> > >     try
> > >     {
> > >         transaction = partition.beginWriteTransaction();
> > >         addContext.setTransaction( transaction );
> > >
> > >         head.add( addContext );
> > >         transaction.commit();
> > >     }
> > >     catch ( LdapException | IOException e )
> > >     {
> > >         try
> > >             {
> > >             if ( transaction != null )
> > >             {
> > >                 transaction.abort();
> > >             }
> > >
> > >             throw new LdapOtherException( e.getMessage(), e );
> > >         }
> > >         catch ( IOException ioe )
> > >         {
> > >             throw new LdapOtherException( ioe.getMessage(), ioe );
> > >         }
> > >     }
> > >     finally
> > >     {
> > >         unlockWrite();
> > >     }
> > >
> > > It sounds a bit convoluted, but what we do is :
> > >
> > > - get the Partition we are working on at this level instead of waiting
> > > to be in the Parition itself. That makes sense because an update
> > > operation (even a move) can't be done across partitions. Note that we
> > > don't fetch the Partition in the Nexus, as we already have it
> > >
> > > - then we ask the selected partition to start a write transaction. What
> > > is interesting here is that not all the partition will create this
> > > transaction. Typically, the Schema partition won't - it makes no sense
> > > -, nor the config partition : they have in-memory indexes, and the
> > > master table is a LDIF file, it's either properly updated or not.
> > >
> > > - the OperationContext stores the transaction and the partition : the
> > > transaction may be reused by interceptors (typically, we may have to
> > > lookup some other entries, or updates entries while processing the
> > > on-going peration)
> > >
> > > - then we call the first interceptor, down to the nexus.
> > >
> > > - If all goes well, the transaction is committed, otherwise, it's
> aborted
> > >
> > >
> > > For JDBM, a commit looks like that :
> > >
> > >     public void commit() throws IOException
> > >     {
> > >         recordManager.commit();
> > >
> > >         // And flush the journal
> > >         BaseRecordManager baseRecordManager = null;
> > >
> > >         if ( recordManager instanceof CacheRecordManager )
> > >         {
> > >             baseRecordManager = ( ( BaseRecordManager ) ( (
> > > CacheRecordManager ) recordManager ).getRecordManager() );
> > >         }
> > >         else
> > >         {
> > >             baseRecordManager = ( ( BaseRecordManager ) recordManager
> );
> > >         }
> > >
> > >
> > >         if ( syncOnWrite )
> > >         {
> > >             baseRecordManager.getTransactionManager().
> synchronizeLog();
> > >         }
> > >     }
> > >
> > > - First the RecordManager is committed, which flush the updates into
> the
> > > Log file (first disk write). Up to this point, all the updates are done
> > > in memory.
> > >
> > > - then we ask the RecordManager to update the databse from the log.
> > >
> > > And that's it.
> > >
> > > Aborting the operaiton is way simpler :
> > >
> > >     public void abort() throws IOException
> > >     {
> > >         recordManager.rollback();
> > >     }
> > >
> > >
> > > I have removed all the sync() methods called all over the code, they
> are
> > > totally useless now.  Bottom line, we end up with having one single
> file
> > > storing the data plus a log file containing the on going updates. If
> the
> > > server crashes, I expect the server to catch up from the journal (still
> > > have to check that aspect).
> > >
> > > Extra benefit : it's *WAY* faster. The perf test I have shows that
> > > adding entries in the server using this code is 2 times faster than
> with
> > > the previous approach (ok, don't get too excited, I'm talking about 180
> > > add/s vs 98 add/s)
> > >
> > >
> > > There are a lot to do still. The Read transactions also need to be
> dealt
> > > with. A typicall read looks lik that :
> > >
> > >     ...
> > >     LookupOperationContext lookupContext = new LookupOperationContext(
> > >   adminSession, modifyContext.getDn(), "+", "*" );
> > >     lookupContext.setPartition( modifyContext.getPartition() );
> > >     lookupContext.setTransaction( modifyContext.getTransaction() );
> > >
> > >     entry = directoryService.getPartitionNexus().lookup(
> lookupContext );
> > >     ...
> > >
> > > here we reuse the modify operation's transaction.
> > >
> > > Or :
> > >
> > >     LookupOperationContext lookupContext = new LookupOperationContext(
> > > adminSession, lookupContext.setPartition( partition );
> > >
> > >     try ( PartitionTxn partitionTxn = partition.beginReadTransaction()
> )
> > >     {
> > >         lookupContext.setTransaction( partitionTxn );
> > >         adminGroup = nexus.lookup( lookupContext );
> > >     }
> > >     catch ( IOException ioe )
> > >     {
> > >         throw new LdapOtherException( ioe.getMessage(), ioe );
> > >     }
> > >
> > > Here, we create a new Read transaction in a try-with-resource call,
> this
> > > transaction will be automatically closed. Note we don't have to commit
> > > or abort it.
> > >
> > >
> > > So now, I'm fixing the tests, changing all the places in the code where
> > > I need to start a read transaction (around 80 places for the lookup
> > > operation, for instance, then there is search, hasEntry, etc...),
> > > properly encapsulating the update operations. That will take time...
> > >
> > > Once it will work for JDBM, I will have to get it to work for Mavibot.
> > >
> > > Still, this is encouraging.
> > >
> > > have a nice week-end.
> > >
> > > --
> > > Emmanuel Lecharny
> > >
> > > Symas.com
> > > directory.apache.org
> > >
> > >
> >
>

Re: Partly done : [ApacheDS] Transaction support

Posted by Emmanuel Lécharny <el...@gmail.com>.
To be clear :

'partly done' means we don't yet support transactions driven by a client
request (aka RFC 5805). This will be the next step, but I tink it's not
urgent.

Also we don't have transactions for the Mavibot partition yet, as we
need to move to the newest version of Mavibot, which has not been
released yet. Again, it's not urgent.

That's it for today :-)


Le 26/02/2018 à 11:27, Emmanuel Lecharny a écrit :
> Hi guys,
> 
> I'm done with the changes in the server : all the tests are now passing green.
> 
> That means we have a JDBM based server with operation being atomic. That should solve the database corruption we have.
> 
> In the process, I also changed the way the API handle requests/responses, modifying the operation Future, removing queues that were used expect for operations that could take more than one response (Bind, Search and ExtendedOperation). We now use a simple wait/notify solution that works well. The Cursor implementation has also been slightly modified to remove the monitor in it, as it was not sued. Last not least, the operation Future.isDone() method is now returning 'true' if we have got an operation response. This is important for asunc implementation as it's a not blocking way to know if we can go on and fetch the operation (if it's not abandoned, of course).
> 
> Otherwise, I also have removed a useless feature in the JDBM Parttition : we were storing the number of elements in a B-tree beside the B-tree, which is totally spurious as teh B-tree already knows about this number : tehre is a btree.size() method. This save a few writes.
> 
> Overall, as I said, the server is 20% faster, and this is due to saved writes and to changes in the way the  LDAP API handles values.
> 
> I still have some cleanup to do, but I think we are on a verge of a release of teh API and the server, which will also leads to a release of Studio :-)
> 
> On 2018/02/21 22:01:38, Emmanuel Lecharny <el...@apache.org> wrote: 
>> Hi,
>>
>> a quick update on the on-going work (last week-end and last night).
>>
>> Most of the core-integ tests are passing now (except 3 errors and 4 failures). FTR, there are 335 instances of OperationContext to check in the whole code, many of them using a readTransaction (149).
>>
>> I'm facing 2 issues atm :
>> - first, the Search operation must not close the transaction until the associated cursor is closed. This is not currently properly done. The best solution I foresee is to store the transaction in the Cursor, and to close it when the cursor is closed. That would imply we may face some issue if the cursor remains open for a long time.
>> - second, as we don't have a file per index anymore, the JdbmPartition initialization is now problematic : we are checking the presence of such files on disk to see if we have to re-create the missing indexes. IMO, this is a wrong thing to do : we should check the configuration and create the index accordingly if the index aren't present in the database. That also mean we should check in the database, something we currently don't do. This is something I have to fix, it's blocking some of the tests (and likely the server won't restart, too).
>>
>> Otherwise, I have checked all the operations, before the OperationManager and after it, the transaction seems to be properly initiated and committed or aborted. 
>>
>> I'm not far from something that works...
>>
>> More later.
>>
>> On 2018/02/17 17:26:05, Emmanuel Lécharny <el...@gmail.com> wrote: 
>>> Hi,
>>>
>>> this is an update on the on-going work.
>>>
>>> I have started to add support of transaction in JDBM first, using the
>>> changes I have applied last week.
>>>
>>> Actually, JDBM supports transaction natively (to some extend), and teh
>>> big mistake we have done was to limit its use on a per-index/table
>>> level. Which means each index and the master table are updated within
>>> its own transaction, which is clearly not good when we have to make each
>>> LDAP operation atomic.
>>>
>>> So the first big change is that I use one single transaction manager
>>> across all the index and the master table : we commit an operation
>>> globally. I do expect that it will make it safer (ie, no concurrent
>>> updates of indexes, leading to some database corruption).
>>>
>>> A typical usage looks like (in OperationManager) :
>>>
>>> public void add( AddOperationContext addContext ) throws LdapException
>>> {
>>>     ...
>>>     // Find the working partition
>>>     Partition partition =
>>> directoryService.getPartitionNexus().getPartition( dn );
>>>     addContext.setPartition( partition );
>>>     ...
>>>
>>>     lockWrite();
>>>
>>>     // Start a Write transaction right away
>>>     PartitionTxn transaction = null;
>>>
>>>     try
>>>     {
>>>         transaction = partition.beginWriteTransaction();
>>>         addContext.setTransaction( transaction );
>>>
>>>         head.add( addContext );
>>>         transaction.commit();
>>>     }
>>>     catch ( LdapException | IOException e )
>>>     {
>>>         try
>>>             {
>>>             if ( transaction != null )
>>>             {
>>>                 transaction.abort();
>>>             }
>>>
>>>             throw new LdapOtherException( e.getMessage(), e );
>>>         }
>>>         catch ( IOException ioe )
>>>         {
>>>             throw new LdapOtherException( ioe.getMessage(), ioe );
>>>         }
>>>     }
>>>     finally
>>>     {
>>>         unlockWrite();
>>>     }
>>>
>>> It sounds a bit convoluted, but what we do is :
>>>
>>> - get the Partition we are working on at this level instead of waiting
>>> to be in the Parition itself. That makes sense because an update
>>> operation (even a move) can't be done across partitions. Note that we
>>> don't fetch the Partition in the Nexus, as we already have it
>>>
>>> - then we ask the selected partition to start a write transaction. What
>>> is interesting here is that not all the partition will create this
>>> transaction. Typically, the Schema partition won't - it makes no sense
>>> -, nor the config partition : they have in-memory indexes, and the
>>> master table is a LDIF file, it's either properly updated or not.
>>>
>>> - the OperationContext stores the transaction and the partition : the
>>> transaction may be reused by interceptors (typically, we may have to
>>> lookup some other entries, or updates entries while processing the
>>> on-going peration)
>>>
>>> - then we call the first interceptor, down to the nexus.
>>>
>>> - If all goes well, the transaction is committed, otherwise, it's aborted
>>>
>>>
>>> For JDBM, a commit looks like that :
>>>
>>>     public void commit() throws IOException
>>>     {
>>>         recordManager.commit();
>>>
>>>         // And flush the journal
>>>         BaseRecordManager baseRecordManager = null;
>>>
>>>         if ( recordManager instanceof CacheRecordManager )
>>>         {
>>>             baseRecordManager = ( ( BaseRecordManager ) ( (
>>> CacheRecordManager ) recordManager ).getRecordManager() );
>>>         }
>>>         else
>>>         {
>>>             baseRecordManager = ( ( BaseRecordManager ) recordManager );
>>>         }
>>>
>>>
>>>         if ( syncOnWrite )
>>>         {
>>>             baseRecordManager.getTransactionManager().synchronizeLog();
>>>         }
>>>     }
>>>
>>> - First the RecordManager is committed, which flush the updates into the
>>> Log file (first disk write). Up to this point, all the updates are done
>>> in memory.
>>>
>>> - then we ask the RecordManager to update the databse from the log.
>>>
>>> And that's it.
>>>
>>> Aborting the operaiton is way simpler :
>>>
>>>     public void abort() throws IOException
>>>     {
>>>         recordManager.rollback();
>>>     }
>>>
>>>
>>> I have removed all the sync() methods called all over the code, they are
>>> totally useless now.  Bottom line, we end up with having one single file
>>> storing the data plus a log file containing the on going updates. If the
>>> server crashes, I expect the server to catch up from the journal (still
>>> have to check that aspect).
>>>
>>> Extra benefit : it's *WAY* faster. The perf test I have shows that
>>> adding entries in the server using this code is 2 times faster than with
>>> the previous approach (ok, don't get too excited, I'm talking about 180
>>> add/s vs 98 add/s)
>>>
>>>
>>> There are a lot to do still. The Read transactions also need to be dealt
>>> with. A typicall read looks lik that :
>>>
>>>     ...
>>>     LookupOperationContext lookupContext = new LookupOperationContext(
>>>   adminSession, modifyContext.getDn(), "+", "*" );
>>>     lookupContext.setPartition( modifyContext.getPartition() );
>>>     lookupContext.setTransaction( modifyContext.getTransaction() );
>>>
>>>     entry = directoryService.getPartitionNexus().lookup( lookupContext );
>>>     ...
>>>
>>> here we reuse the modify operation's transaction.
>>>
>>> Or :
>>>
>>>     LookupOperationContext lookupContext = new LookupOperationContext(
>>> adminSession, lookupContext.setPartition( partition );
>>>
>>>     try ( PartitionTxn partitionTxn = partition.beginReadTransaction() )
>>>     {
>>>         lookupContext.setTransaction( partitionTxn );
>>>         adminGroup = nexus.lookup( lookupContext );
>>>     }
>>>     catch ( IOException ioe )
>>>     {
>>>         throw new LdapOtherException( ioe.getMessage(), ioe );
>>>     }
>>>
>>> Here, we create a new Read transaction in a try-with-resource call, this
>>> transaction will be automatically closed. Note we don't have to commit
>>> or abort it.
>>>
>>>
>>> So now, I'm fixing the tests, changing all the places in the code where
>>> I need to start a read transaction (around 80 places for the lookup
>>> operation, for instance, then there is search, hasEntry, etc...),
>>> properly encapsulating the update operations. That will take time...
>>>
>>> Once it will work for JDBM, I will have to get it to work for Mavibot.
>>>
>>> Still, this is encouraging.
>>>
>>> have a nice week-end.
>>>
>>> -- 
>>> Emmanuel Lecharny
>>>
>>> Symas.com
>>> directory.apache.org
>>>
>>>
>>

-- 
Emmanuel Lecharny

Symas.com
directory.apache.org


Partly done : [ApacheDS] Transaction support

Posted by Emmanuel Lecharny <el...@apache.org>.
Hi guys,

I'm done with the changes in the server : all the tests are now passing green.

That means we have a JDBM based server with operation being atomic. That should solve the database corruption we have.

In the process, I also changed the way the API handle requests/responses, modifying the operation Future, removing queues that were used expect for operations that could take more than one response (Bind, Search and ExtendedOperation). We now use a simple wait/notify solution that works well. The Cursor implementation has also been slightly modified to remove the monitor in it, as it was not sued. Last not least, the operation Future.isDone() method is now returning 'true' if we have got an operation response. This is important for asunc implementation as it's a not blocking way to know if we can go on and fetch the operation (if it's not abandoned, of course).

Otherwise, I also have removed a useless feature in the JDBM Parttition : we were storing the number of elements in a B-tree beside the B-tree, which is totally spurious as teh B-tree already knows about this number : tehre is a btree.size() method. This save a few writes.

Overall, as I said, the server is 20% faster, and this is due to saved writes and to changes in the way the  LDAP API handles values.

I still have some cleanup to do, but I think we are on a verge of a release of teh API and the server, which will also leads to a release of Studio :-)

On 2018/02/21 22:01:38, Emmanuel Lecharny <el...@apache.org> wrote: 
> Hi,
> 
> a quick update on the on-going work (last week-end and last night).
> 
> Most of the core-integ tests are passing now (except 3 errors and 4 failures). FTR, there are 335 instances of OperationContext to check in the whole code, many of them using a readTransaction (149).
> 
> I'm facing 2 issues atm :
> - first, the Search operation must not close the transaction until the associated cursor is closed. This is not currently properly done. The best solution I foresee is to store the transaction in the Cursor, and to close it when the cursor is closed. That would imply we may face some issue if the cursor remains open for a long time.
> - second, as we don't have a file per index anymore, the JdbmPartition initialization is now problematic : we are checking the presence of such files on disk to see if we have to re-create the missing indexes. IMO, this is a wrong thing to do : we should check the configuration and create the index accordingly if the index aren't present in the database. That also mean we should check in the database, something we currently don't do. This is something I have to fix, it's blocking some of the tests (and likely the server won't restart, too).
> 
> Otherwise, I have checked all the operations, before the OperationManager and after it, the transaction seems to be properly initiated and committed or aborted. 
> 
> I'm not far from something that works...
> 
> More later.
> 
> On 2018/02/17 17:26:05, Emmanuel Lécharny <el...@gmail.com> wrote: 
> > Hi,
> > 
> > this is an update on the on-going work.
> > 
> > I have started to add support of transaction in JDBM first, using the
> > changes I have applied last week.
> > 
> > Actually, JDBM supports transaction natively (to some extend), and teh
> > big mistake we have done was to limit its use on a per-index/table
> > level. Which means each index and the master table are updated within
> > its own transaction, which is clearly not good when we have to make each
> > LDAP operation atomic.
> > 
> > So the first big change is that I use one single transaction manager
> > across all the index and the master table : we commit an operation
> > globally. I do expect that it will make it safer (ie, no concurrent
> > updates of indexes, leading to some database corruption).
> > 
> > A typical usage looks like (in OperationManager) :
> > 
> > public void add( AddOperationContext addContext ) throws LdapException
> > {
> >     ...
> >     // Find the working partition
> >     Partition partition =
> > directoryService.getPartitionNexus().getPartition( dn );
> >     addContext.setPartition( partition );
> >     ...
> > 
> >     lockWrite();
> > 
> >     // Start a Write transaction right away
> >     PartitionTxn transaction = null;
> > 
> >     try
> >     {
> >         transaction = partition.beginWriteTransaction();
> >         addContext.setTransaction( transaction );
> > 
> >         head.add( addContext );
> >         transaction.commit();
> >     }
> >     catch ( LdapException | IOException e )
> >     {
> >         try
> >             {
> >             if ( transaction != null )
> >             {
> >                 transaction.abort();
> >             }
> > 
> >             throw new LdapOtherException( e.getMessage(), e );
> >         }
> >         catch ( IOException ioe )
> >         {
> >             throw new LdapOtherException( ioe.getMessage(), ioe );
> >         }
> >     }
> >     finally
> >     {
> >         unlockWrite();
> >     }
> > 
> > It sounds a bit convoluted, but what we do is :
> > 
> > - get the Partition we are working on at this level instead of waiting
> > to be in the Parition itself. That makes sense because an update
> > operation (even a move) can't be done across partitions. Note that we
> > don't fetch the Partition in the Nexus, as we already have it
> > 
> > - then we ask the selected partition to start a write transaction. What
> > is interesting here is that not all the partition will create this
> > transaction. Typically, the Schema partition won't - it makes no sense
> > -, nor the config partition : they have in-memory indexes, and the
> > master table is a LDIF file, it's either properly updated or not.
> > 
> > - the OperationContext stores the transaction and the partition : the
> > transaction may be reused by interceptors (typically, we may have to
> > lookup some other entries, or updates entries while processing the
> > on-going peration)
> > 
> > - then we call the first interceptor, down to the nexus.
> > 
> > - If all goes well, the transaction is committed, otherwise, it's aborted
> > 
> > 
> > For JDBM, a commit looks like that :
> > 
> >     public void commit() throws IOException
> >     {
> >         recordManager.commit();
> > 
> >         // And flush the journal
> >         BaseRecordManager baseRecordManager = null;
> > 
> >         if ( recordManager instanceof CacheRecordManager )
> >         {
> >             baseRecordManager = ( ( BaseRecordManager ) ( (
> > CacheRecordManager ) recordManager ).getRecordManager() );
> >         }
> >         else
> >         {
> >             baseRecordManager = ( ( BaseRecordManager ) recordManager );
> >         }
> > 
> > 
> >         if ( syncOnWrite )
> >         {
> >             baseRecordManager.getTransactionManager().synchronizeLog();
> >         }
> >     }
> > 
> > - First the RecordManager is committed, which flush the updates into the
> > Log file (first disk write). Up to this point, all the updates are done
> > in memory.
> > 
> > - then we ask the RecordManager to update the databse from the log.
> > 
> > And that's it.
> > 
> > Aborting the operaiton is way simpler :
> > 
> >     public void abort() throws IOException
> >     {
> >         recordManager.rollback();
> >     }
> > 
> > 
> > I have removed all the sync() methods called all over the code, they are
> > totally useless now.  Bottom line, we end up with having one single file
> > storing the data plus a log file containing the on going updates. If the
> > server crashes, I expect the server to catch up from the journal (still
> > have to check that aspect).
> > 
> > Extra benefit : it's *WAY* faster. The perf test I have shows that
> > adding entries in the server using this code is 2 times faster than with
> > the previous approach (ok, don't get too excited, I'm talking about 180
> > add/s vs 98 add/s)
> > 
> > 
> > There are a lot to do still. The Read transactions also need to be dealt
> > with. A typicall read looks lik that :
> > 
> >     ...
> >     LookupOperationContext lookupContext = new LookupOperationContext(
> >   adminSession, modifyContext.getDn(), "+", "*" );
> >     lookupContext.setPartition( modifyContext.getPartition() );
> >     lookupContext.setTransaction( modifyContext.getTransaction() );
> > 
> >     entry = directoryService.getPartitionNexus().lookup( lookupContext );
> >     ...
> > 
> > here we reuse the modify operation's transaction.
> > 
> > Or :
> > 
> >     LookupOperationContext lookupContext = new LookupOperationContext(
> > adminSession, lookupContext.setPartition( partition );
> > 
> >     try ( PartitionTxn partitionTxn = partition.beginReadTransaction() )
> >     {
> >         lookupContext.setTransaction( partitionTxn );
> >         adminGroup = nexus.lookup( lookupContext );
> >     }
> >     catch ( IOException ioe )
> >     {
> >         throw new LdapOtherException( ioe.getMessage(), ioe );
> >     }
> > 
> > Here, we create a new Read transaction in a try-with-resource call, this
> > transaction will be automatically closed. Note we don't have to commit
> > or abort it.
> > 
> > 
> > So now, I'm fixing the tests, changing all the places in the code where
> > I need to start a read transaction (around 80 places for the lookup
> > operation, for instance, then there is search, hasEntry, etc...),
> > properly encapsulating the update operations. That will take time...
> > 
> > Once it will work for JDBM, I will have to get it to work for Mavibot.
> > 
> > Still, this is encouraging.
> > 
> > have a nice week-end.
> > 
> > -- 
> > Emmanuel Lecharny
> > 
> > Symas.com
> > directory.apache.org
> > 
> > 
> 

Update : [ApacheDS] Transaction support

Posted by Emmanuel Lecharny <el...@apache.org>.
Hi,

a quick update on the on-going work (last week-end and last night).

Most of the core-integ tests are passing now (except 3 errors and 4 failures). FTR, there are 335 instances of OperationContext to check in the whole code, many of them using a readTransaction (149).

I'm facing 2 issues atm :
- first, the Search operation must not close the transaction until the associated cursor is closed. This is not currently properly done. The best solution I foresee is to store the transaction in the Cursor, and to close it when the cursor is closed. That would imply we may face some issue if the cursor remains open for a long time.
- second, as we don't have a file per index anymore, the JdbmPartition initialization is now problematic : we are checking the presence of such files on disk to see if we have to re-create the missing indexes. IMO, this is a wrong thing to do : we should check the configuration and create the index accordingly if the index aren't present in the database. That also mean we should check in the database, something we currently don't do. This is something I have to fix, it's blocking some of the tests (and likely the server won't restart, too).

Otherwise, I have checked all the operations, before the OperationManager and after it, the transaction seems to be properly initiated and committed or aborted. 

I'm not far from something that works...

More later.

On 2018/02/17 17:26:05, Emmanuel Lécharny <el...@gmail.com> wrote: 
> Hi,
> 
> this is an update on the on-going work.
> 
> I have started to add support of transaction in JDBM first, using the
> changes I have applied last week.
> 
> Actually, JDBM supports transaction natively (to some extend), and teh
> big mistake we have done was to limit its use on a per-index/table
> level. Which means each index and the master table are updated within
> its own transaction, which is clearly not good when we have to make each
> LDAP operation atomic.
> 
> So the first big change is that I use one single transaction manager
> across all the index and the master table : we commit an operation
> globally. I do expect that it will make it safer (ie, no concurrent
> updates of indexes, leading to some database corruption).
> 
> A typical usage looks like (in OperationManager) :
> 
> public void add( AddOperationContext addContext ) throws LdapException
> {
>     ...
>     // Find the working partition
>     Partition partition =
> directoryService.getPartitionNexus().getPartition( dn );
>     addContext.setPartition( partition );
>     ...
> 
>     lockWrite();
> 
>     // Start a Write transaction right away
>     PartitionTxn transaction = null;
> 
>     try
>     {
>         transaction = partition.beginWriteTransaction();
>         addContext.setTransaction( transaction );
> 
>         head.add( addContext );
>         transaction.commit();
>     }
>     catch ( LdapException | IOException e )
>     {
>         try
>             {
>             if ( transaction != null )
>             {
>                 transaction.abort();
>             }
> 
>             throw new LdapOtherException( e.getMessage(), e );
>         }
>         catch ( IOException ioe )
>         {
>             throw new LdapOtherException( ioe.getMessage(), ioe );
>         }
>     }
>     finally
>     {
>         unlockWrite();
>     }
> 
> It sounds a bit convoluted, but what we do is :
> 
> - get the Partition we are working on at this level instead of waiting
> to be in the Parition itself. That makes sense because an update
> operation (even a move) can't be done across partitions. Note that we
> don't fetch the Partition in the Nexus, as we already have it
> 
> - then we ask the selected partition to start a write transaction. What
> is interesting here is that not all the partition will create this
> transaction. Typically, the Schema partition won't - it makes no sense
> -, nor the config partition : they have in-memory indexes, and the
> master table is a LDIF file, it's either properly updated or not.
> 
> - the OperationContext stores the transaction and the partition : the
> transaction may be reused by interceptors (typically, we may have to
> lookup some other entries, or updates entries while processing the
> on-going peration)
> 
> - then we call the first interceptor, down to the nexus.
> 
> - If all goes well, the transaction is committed, otherwise, it's aborted
> 
> 
> For JDBM, a commit looks like that :
> 
>     public void commit() throws IOException
>     {
>         recordManager.commit();
> 
>         // And flush the journal
>         BaseRecordManager baseRecordManager = null;
> 
>         if ( recordManager instanceof CacheRecordManager )
>         {
>             baseRecordManager = ( ( BaseRecordManager ) ( (
> CacheRecordManager ) recordManager ).getRecordManager() );
>         }
>         else
>         {
>             baseRecordManager = ( ( BaseRecordManager ) recordManager );
>         }
> 
> 
>         if ( syncOnWrite )
>         {
>             baseRecordManager.getTransactionManager().synchronizeLog();
>         }
>     }
> 
> - First the RecordManager is committed, which flush the updates into the
> Log file (first disk write). Up to this point, all the updates are done
> in memory.
> 
> - then we ask the RecordManager to update the databse from the log.
> 
> And that's it.
> 
> Aborting the operaiton is way simpler :
> 
>     public void abort() throws IOException
>     {
>         recordManager.rollback();
>     }
> 
> 
> I have removed all the sync() methods called all over the code, they are
> totally useless now.  Bottom line, we end up with having one single file
> storing the data plus a log file containing the on going updates. If the
> server crashes, I expect the server to catch up from the journal (still
> have to check that aspect).
> 
> Extra benefit : it's *WAY* faster. The perf test I have shows that
> adding entries in the server using this code is 2 times faster than with
> the previous approach (ok, don't get too excited, I'm talking about 180
> add/s vs 98 add/s)
> 
> 
> There are a lot to do still. The Read transactions also need to be dealt
> with. A typicall read looks lik that :
> 
>     ...
>     LookupOperationContext lookupContext = new LookupOperationContext(
>   adminSession, modifyContext.getDn(), "+", "*" );
>     lookupContext.setPartition( modifyContext.getPartition() );
>     lookupContext.setTransaction( modifyContext.getTransaction() );
> 
>     entry = directoryService.getPartitionNexus().lookup( lookupContext );
>     ...
> 
> here we reuse the modify operation's transaction.
> 
> Or :
> 
>     LookupOperationContext lookupContext = new LookupOperationContext(
> adminSession, lookupContext.setPartition( partition );
> 
>     try ( PartitionTxn partitionTxn = partition.beginReadTransaction() )
>     {
>         lookupContext.setTransaction( partitionTxn );
>         adminGroup = nexus.lookup( lookupContext );
>     }
>     catch ( IOException ioe )
>     {
>         throw new LdapOtherException( ioe.getMessage(), ioe );
>     }
> 
> Here, we create a new Read transaction in a try-with-resource call, this
> transaction will be automatically closed. Note we don't have to commit
> or abort it.
> 
> 
> So now, I'm fixing the tests, changing all the places in the code where
> I need to start a read transaction (around 80 places for the lookup
> operation, for instance, then there is search, hasEntry, etc...),
> properly encapsulating the update operations. That will take time...
> 
> Once it will work for JDBM, I will have to get it to work for Mavibot.
> 
> Still, this is encouraging.
> 
> have a nice week-end.
> 
> -- 
> Emmanuel Lecharny
> 
> Symas.com
> directory.apache.org
> 
> 

Uodate : [ApacheDS] Transaction support

Posted by Emmanuel Lécharny <el...@gmail.com>.
Hi,

this is an update on the on-going work.

I have started to add support of transaction in JDBM first, using the
changes I have applied last week.

Actually, JDBM supports transaction natively (to some extend), and teh
big mistake we have done was to limit its use on a per-index/table
level. Which means each index and the master table are updated within
its own transaction, which is clearly not good when we have to make each
LDAP operation atomic.

So the first big change is that I use one single transaction manager
across all the index and the master table : we commit an operation
globally. I do expect that it will make it safer (ie, no concurrent
updates of indexes, leading to some database corruption).

A typical usage looks like (in OperationManager) :

public void add( AddOperationContext addContext ) throws LdapException
{
    ...
    // Find the working partition
    Partition partition =
directoryService.getPartitionNexus().getPartition( dn );
    addContext.setPartition( partition );
    ...

    lockWrite();

    // Start a Write transaction right away
    PartitionTxn transaction = null;

    try
    {
        transaction = partition.beginWriteTransaction();
        addContext.setTransaction( transaction );

        head.add( addContext );
        transaction.commit();
    }
    catch ( LdapException | IOException e )
    {
        try
            {
            if ( transaction != null )
            {
                transaction.abort();
            }

            throw new LdapOtherException( e.getMessage(), e );
        }
        catch ( IOException ioe )
        {
            throw new LdapOtherException( ioe.getMessage(), ioe );
        }
    }
    finally
    {
        unlockWrite();
    }

It sounds a bit convoluted, but what we do is :

- get the Partition we are working on at this level instead of waiting
to be in the Parition itself. That makes sense because an update
operation (even a move) can't be done across partitions. Note that we
don't fetch the Partition in the Nexus, as we already have it

- then we ask the selected partition to start a write transaction. What
is interesting here is that not all the partition will create this
transaction. Typically, the Schema partition won't - it makes no sense
-, nor the config partition : they have in-memory indexes, and the
master table is a LDIF file, it's either properly updated or not.

- the OperationContext stores the transaction and the partition : the
transaction may be reused by interceptors (typically, we may have to
lookup some other entries, or updates entries while processing the
on-going peration)

- then we call the first interceptor, down to the nexus.

- If all goes well, the transaction is committed, otherwise, it's aborted


For JDBM, a commit looks like that :

    public void commit() throws IOException
    {
        recordManager.commit();

        // And flush the journal
        BaseRecordManager baseRecordManager = null;

        if ( recordManager instanceof CacheRecordManager )
        {
            baseRecordManager = ( ( BaseRecordManager ) ( (
CacheRecordManager ) recordManager ).getRecordManager() );
        }
        else
        {
            baseRecordManager = ( ( BaseRecordManager ) recordManager );
        }


        if ( syncOnWrite )
        {
            baseRecordManager.getTransactionManager().synchronizeLog();
        }
    }

- First the RecordManager is committed, which flush the updates into the
Log file (first disk write). Up to this point, all the updates are done
in memory.

- then we ask the RecordManager to update the databse from the log.

And that's it.

Aborting the operaiton is way simpler :

    public void abort() throws IOException
    {
        recordManager.rollback();
    }


I have removed all the sync() methods called all over the code, they are
totally useless now.  Bottom line, we end up with having one single file
storing the data plus a log file containing the on going updates. If the
server crashes, I expect the server to catch up from the journal (still
have to check that aspect).

Extra benefit : it's *WAY* faster. The perf test I have shows that
adding entries in the server using this code is 2 times faster than with
the previous approach (ok, don't get too excited, I'm talking about 180
add/s vs 98 add/s)


There are a lot to do still. The Read transactions also need to be dealt
with. A typicall read looks lik that :

    ...
    LookupOperationContext lookupContext = new LookupOperationContext(
  adminSession, modifyContext.getDn(), "+", "*" );
    lookupContext.setPartition( modifyContext.getPartition() );
    lookupContext.setTransaction( modifyContext.getTransaction() );

    entry = directoryService.getPartitionNexus().lookup( lookupContext );
    ...

here we reuse the modify operation's transaction.

Or :

    LookupOperationContext lookupContext = new LookupOperationContext(
adminSession, lookupContext.setPartition( partition );

    try ( PartitionTxn partitionTxn = partition.beginReadTransaction() )
    {
        lookupContext.setTransaction( partitionTxn );
        adminGroup = nexus.lookup( lookupContext );
    }
    catch ( IOException ioe )
    {
        throw new LdapOtherException( ioe.getMessage(), ioe );
    }

Here, we create a new Read transaction in a try-with-resource call, this
transaction will be automatically closed. Note we don't have to commit
or abort it.


So now, I'm fixing the tests, changing all the places in the code where
I need to start a read transaction (around 80 places for the lookup
operation, for instance, then there is search, hasEntry, etc...),
properly encapsulating the update operations. That will take time...

Once it will work for JDBM, I will have to get it to work for Mavibot.

Still, this is encouraging.

have a nice week-end.

-- 
Emmanuel Lecharny

Symas.com
directory.apache.org


Re: [ApacheDS] Transaction support

Posted by Emmanuel Lecharny <el...@apache.org>.
Forgot ot mention that I tried to remove the Store interface, that was added a decade ago. I let it be, because it would have impacted the SchemaPartition. So we still have the Store interface around. Well...

On 2018/02/11 17:43:33, Emmanuel Lécharny <el...@gmail.com> wrote: 
> Hi guys,
> 
> a quick head up about this on-going effort.
> 
> First of all, I added the Transaction extended operation in the LDAP
> API, but this is somehow orthogonal. We don't really need that at the
> moment in the server, but we will most certainly leverage it later for
> some interesting feature (see later).
> 
> At the moment, the idea is to add cross- B-tree transaction to
> partitions (at least JDBM/Mavibot partitions). This is critical because
> it will fix the corruption problem we have.
> 
> The idea is to start a transaction in the OperationManager, either re	ad
> or write dependening on the operation. We can have many read operations
> going on, but only one write operation (for the JDBM partition, we will
> have some more constraints).
> 
> Transactions have to be started by partitions, as the upper layer (ie
> operation manager has no way to know how the lower level (ie partitions)
> deal with transactions. This is possible because we can determinate
> which partition we are addressing using the operation's DN. The first
> thing to do is to move this part from partitions to the operation manager.
> 
> Then we have to propagate the txn down to the nexus. The easier way is
> to pass the txn into the OperationContext instances. Once we went
> through the interceptors down to the Nexus partition, we have to apply
> the operation to the specific partition, and this is done by the
> AbstractBtreePartition, mostly. Enough said that each B-tree update
> needs to know about the txn, so we have to modify the basic partition
> operations to take an extra parameter : the txn. And this is where it
> starts to be hard... Because that implies we also have to extend the
> following interfaces :
> 
> - Table
> - Index
> - Cursor
> 
> so that they also take this txn as a parameter.
> 
> It's a bit of a gigantic change in the interfaces... Note that we don't
> want to change the LDAP API cursor interface, too.
> 
> Then , we have to change teh way teh JDBM partition behave. Currently,
> we create a RecordManager for each B-tree (ie, each Table, which may
> have 2 B-tree, the forward and reverse index). This is not good, because
> we can't apply a global operation across many recordManager using a txn.
> The JDBM transactionManager is applied to a single RecordManager. So the
> next big change is to have JDBM working using one single file. That
> impacts the initialisation of JDBM index and partition.
> 
> For mavibot, it's simpler, because we already use one single file anyway.
> 
> ATM, I have made the required changes in the Partition/Index/Cursor to
> pass this extrac txn paramer, and I'm dealing with the unique file
> definition. The LDIF Partition is working just fine with all those
> changes, and many of the JDBM tests are also running green.
> 
> The last thing to do is for JDBM to make sure we don't have a collision
> between reads and write (ie we can't read when we write). That will slow
> down the server when we do a write, but JDBM won't be able to deal with
> concurrent reads and writes anyway. Mavibot does not have such
> limitation, so I expect that Mavibot will become teh de-facto backend
> soon after those changes.
> 
> LAst, not least, being able to leverage the Transaction extended
> operation will allow a fast loading of the server, especially during the
> initial injection of data: we can do that in memory, and flush teh
> result globally.
> 
> I expect all those changes to take a couple of weeks - working on
> evenings and week-end).
> 
> 
> 
> I'll keep you posted anyway.
> 
> -- 
> Emmanuel Lecharny
> 
> Symas.com
> directory.apache.org
> 
>