You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@mina.apache.org by Jason Resch <jr...@cleversafe.com> on 2010/04/03 17:10:34 UTC

Possible Bug in SSLHandler

All,

I think I've discovered a bug in Mina's TLS code.  While I encountered this bug in Mina 1.1.7 it seems the same problem code exists in Mina 2.0.  The code in question is the flushScheduledEvents() method in SSLHandler.java:

public void flushScheduledEvents() {
    // Fire events only when no lock is hold for this handler.
    if (Thread.holdsLock(this)) {
        return;
    }

    Event e;
       
    // We need synchronization here inevitably because filterWrite can be
    // called simultaneously and cause 'bad record MAC' integrity error.
    synchronized (this) {
        while ((e = filterWriteEventQueue.poll()) != null) {
            e.nextFilter.filterWrite(session, (WriteRequest) e.data);
        }
    }

    while ((e = messageReceivedEventQueue.poll()) != null) {
        e.nextFilter.messageReceived(session, e.data);
    }
}

This method is called both by threads which handle writes, and threads that handle reads.  Therefore, as the comments suggest, multiple threads may go through this code simultaneously.  However, since there is no synchronization around processing of the messageReceivedEventQueue, it is possible that the received messages will be sent to the next filter out of order, should there be more than one message in the queue and a context switch happen at the wrong time.

The bug would manifest in our application as a failure of our protocol layer to decode a message, we believe, due to a re-ordering.  It only occurred in environments with a large amount of contention and network traffic and when using TLS.  The fix I have tested was to move the closing brace of the synchronized block to extend to cover both while loops.  I've attached a patch representing that change.  Since making that change we have not encountered the bug again after about 30 hours of testing and 1.5 TB of traffic, whereas before the change we could reproduce it after a few minutes.


Jason

Re: Possible Bug in SSLHandler

Posted by Emmanuel Lécharny <el...@apache.org>.

> Emmanuel Lécharny wrote:
>>> Emmanuel,
>>>
>>> The issue isn't related to any type of corruption of concurrent 
>>> access to the event queue. 
>> Sorry, Jason, I reread your first mal carefully and you are most 
>> certainly right.
>>
>> can you create a JIRA and post your suggested fix ? This is the best 
>> way to get this patch injected and the only way for us to get a track 
>> of it before we release 2.0. Set this problem as critical or even 
>> better, blocker.
>>
>> Many thanks !
>>
>>
> Emmanuel,
>
> It's submitted as: DIRMINA-779, though I don't think its priority is 
> set to blocker..

Ok, thanks !


-- 
Regards,
Cordialement,
Emmanuel Lécharny
www.nextury.com

Re: Possible Bug in SSLHandler

Posted by Jason Resch <jr...@cleversafe.com>.

Emmanuel Lécharny wrote:
>> Emmanuel,
>>
>> The issue isn't related to any type of corruption of concurrent access 
>> to the event queue. 
>>     
> Sorry, Jason, I reread your first mal carefully and you are most 
> certainly right.
>
> can you create a JIRA and post your suggested fix ? This is the best way 
> to get this patch injected and the only way for us to get a track of it 
> before we release 2.0. Set this problem as critical or even better, blocker.
>
> Many thanks !
>
>
>   
Emmanuel,

It's submitted as: DIRMINA-779, though I don't think its priority is set 
to blocker..

Thanks,

Jason

Re: Possible Bug in SSLHandler

Posted by Emmanuel Lécharny <el...@apache.org>.

> Emmanuel,
>
> The issue isn't related to any type of corruption of concurrent access 
> to the event queue. 
Sorry, Jason, I reread your first mal carefully and you are most 
certainly right.

can you create a JIRA and post your suggested fix ? This is the best way 
to get this patch injected and the only way for us to get a track of it 
before we release 2.0. Set this problem as critical or even better, blocker.

Many thanks !


-- 
Regards,
Cordialement,
Emmanuel Lécharny
www.nextury.com

Re: Possible Bug in SSLHandler

Posted by Jason Resch <jr...@cleversafe.com>.

Emmanuel,

The issue isn't related to any type of corruption of concurrent access 
to the event queue.  It is that events may be sent to the next filter in 
the wrong order from the order in which they were received.  For 
instance, the application layer might have a large protocol message 
structure, lets say a few MB.  While SSL deals with small records, 
around a few tens of KB.  While each SSL record is decrypted and put on 
the Queue in the right order, the code in flushScheduleEvents does not 
synchronize around the point when things are removed from the queue and 
sent to the next filter.  Therefore this issue should still exist if the 
latest code in trunk contains this in the flushScheduledEvents() method:

|      while ((e = messageReceivedEventQueue.poll()) != null) {
          e.nextFilter.messageReceived(session, e.data);
      }|

If there is a context switch to another thread right after an event is 
polled from the thread, the next thread will take the next event off the 
queue and send it to the next filter before the first thread does.  That 
while loop must be synchronized.

This is recognized in the above comment for writing, by the fact that 
when writing, improperly ordered data will cause bad record MAC errors, 
but it the same issue applies for processing reads, only it will 
manifest as a protocol decoding error in the higher-level application 
protocol.

Looking at the last code from trunk, the problem code still exists:
http://svn.apache.org/repos/asf/mina/trunk/core/src/main/java/org/apache/mina/filter/ssl/SslHandler.java
Ctrl+F for "flushScheduledEvents"

Jason

Emmanuel Lecharny wrote:
> On 4/3/10 6:17 PM, Bernd Fondermann wrote:
>   
>> Jason,
>>
>> Thanks for your contribution!
>> Would you mind to open a JIRA bug report and attach the diff file to
>> it, checking the ASF license check box?
>>    
>>     
> Not sure this is still an issue in trunk, as the SSL layer has been 
> totally rewritten é months ago (for instance, there is no more 
> ...filter/support/... package). The filterWriteEventQueue data structure 
> is now a ConcurrentLinkedQueue and protected against concurrent accesses.
>
> Can you give the trunk a try ?
>
> Thanks !--
>
> Regards,
> Cordialement,
> Emmanuel Lécharny
> www.nextury.com
>
>
>

Re: Possible Bug in SSLHandler

Posted by Emmanuel Lecharny <el...@gmail.com>.

On 4/3/10 6:17 PM, Bernd Fondermann wrote:
> Jason,
>
> Thanks for your contribution!
> Would you mind to open a JIRA bug report and attach the diff file to
> it, checking the ASF license check box?
>    
Not sure this is still an issue in trunk, as the SSL layer has been 
totally rewritten é months ago (for instance, there is no more 
...filter/support/... package). The filterWriteEventQueue data structure 
is now a ConcurrentLinkedQueue and protected against concurrent accesses.

Can you give the trunk a try ?

Thanks !--

Regards,
Cordialement,
Emmanuel Lécharny
www.nextury.com

Re: Possible Bug in SSLHandler

Posted by Bernd Fondermann <be...@googlemail.com>.

Jason,

Thanks for your contribution!
Would you mind to open a JIRA bug report and attach the diff file to
it, checking the ASF license check box?

Thanks a lot,

  Bernd

On Sat, Apr 3, 2010 at 17:10, Jason Resch <jr...@cleversafe.com> wrote:
> All,
>
> I think I've discovered a bug in Mina's TLS code.  While I encountered this
> bug in Mina 1.1.7 it seems the same problem code exists in Mina 2.0.  The
> code in question is the flushScheduledEvents() method in SSLHandler.java:
>
> public void flushScheduledEvents() {
>   // Fire events only when no lock is hold for this handler.
>   if (Thread.holdsLock(this)) {
>       return;
>   }
>
>   Event e;
>         // We need synchronization here inevitably because filterWrite can
> be
>   // called simultaneously and cause 'bad record MAC' integrity error.
>   synchronized (this) {
>       while ((e = filterWriteEventQueue.poll()) != null) {
>           e.nextFilter.filterWrite(session, (WriteRequest) e.data);
>       }
>   }
>
>   while ((e = messageReceivedEventQueue.poll()) != null) {
>       e.nextFilter.messageReceived(session, e.data);
>   }
> }
>
> This method is called both by threads which handle writes, and threads that
> handle reads.  Therefore, as the comments suggest, multiple threads may go
> through this code simultaneously.  However, since there is no
> synchronization around processing of the messageReceivedEventQueue, it is
> possible that the received messages will be sent to the next filter out of
> order, should there be more than one message in the queue and a context
> switch happen at the wrong time.
>
> The bug would manifest in our application as a failure of our protocol layer
> to decode a message, we believe, due to a re-ordering.  It only occurred in
> environments with a large amount of contention and network traffic and when
> using TLS.  The fix I have tested was to move the closing brace of the
> synchronized block to extend to cover both while loops.  I've attached a
> patch representing that change.  Since making that change we have not
> encountered the bug again after about 30 hours of testing and 1.5 TB of
> traffic, whereas before the change we could reproduce it after a few
> minutes.
>
>
> Jason
>
>