You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@directory.apache.org by Emmanuel Lécharny <el...@gmail.com> on 2012/11/26 16:03:33 UTC

Replication producer potential blocking issue, and a few improvement proposals

Hi,

so I spent the week-end reviewing the replication code in M8, and I
found a few area that can be improved, and a potentially serious problem
that needs to be fixed.

first of all, the few tests we conducted last week shown that
replication is working pretty well, *if* we have no communication loss
between the consumers and the server. We have some issues when a
consumer is disconnected and reconnected back (still have to investigate
this bug).

Otherwise, there are a few areas for improvement, but this is not urgent
- the journal does not have to be cleaned up entry by entry. This is an
extremely costly operation, requiring a lot of writes on disk. There is
a better way to manage the old elements : we can simply have a rotating
journal, and keep a current and an old journal. When the current journal
is full, it becomes the old journal, and the old journal is simply deleted.
- I think having one journal instead of having a journal per consumer is
a better idea. This worth being discussed for a future implementation,
but right now, I'm fine with the one- consumer/one journal approach.
- The consumer RID should not be created by the producer. This is just
used by the consumer to distinguish between two different replication
configuration declared on the consumer. The producer does not have to
keep such information locally. That also mean we may have more than one
journal for a server, as we may declare more than one replication
consumer from a server A to a server B.
- More critical : the way the EventInterceptor is implemented, we have
no way to check the authorization. We must find a way to go through the
authorization interceptor in order to check that each entry is allowed
to be sent to a consumer.
- Another big issue : we don't filter the AttributeType we send to the
consumers, AFAICT

Now, the problem :
- As we depend on MINA 2 to send the entries to the consumers, we have
to be extremely careful when we do things like :

    private void sendResult( SearchResultEntry searchResultEntry, Entry
entry, EventType eventType,
        SyncStateValue syncStateValue )
    {
        searchResultEntry.addControl( syncStateValue );

        LOG.debug( "sending event {} of entry {}", eventType,
entry.getDn() );
        WriteFuture future = session.getIoSession().write(
searchResultEntry );

        // Now, send the entry to the consumer
        handleWriteFuture( future, entry, eventType );
    }

with :

    private void handleWriteFuture( WriteFuture future, Entry entry,
EventType event )
    {
        // Let the operation be executed.
        // Note : we wait 10 seconds max
        future.awaitUninterruptibly( 10000L );
       
        if ( !future.isWritten() )
        {
            LOG.error( "Failed to write to the consumer {} during the
event {} on entry {}", new Object[] {
                           consumerMsgLog.getId(), event, entry.getDn() } );
            LOG.error( "", future.getException() );

            // set realtime push to false, will be set back to true when
the client
            // comes back and sends another request this flag will be
set to true
            pushInRealTime = false;
        }
        else
        {
            try
            {
                // if successful update the last sent CSN
                consumerMsgLog.setLastSentCsn( entry.get(
SchemaConstants.ENTRY_CSN_AT ).getString() );
            }
            catch( Exception e )
            {
                //should never happen
                LOG.error( "No entry CSN attribute found", e );
            }
        }
    }

If the consumer is disconnected, the current thread will be blocked for
up to 10 seconds (that in the case the consumer wasn't gracefully
disconnected...). For 10 seconds, the current thread will just do
nothing but wait. We don't have hundreds of threads, at some point, this
can become problematic...

The best way to fix that would be to have a separated thread per
consumer, and to use a queue where the events are pushed, quueue that
will be read by the consumer's thread. As we have a queue in the middle,
and a thread per consumer, we can guarantee that handling a modification
is done fast enough on the local server, and propagated efficiently, or
that the consumer's disconnection will be handled without blocking any
server's thread.

I'm continuing my investigations !

-- 
Regards,
Cordialement,
Emmanuel Lécharny
www.iktek.com