You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mina.apache.org by Jason Resch <jr...@cleversafe.com> on 2010/04/03 17:10:34 UTC
Possible Bug in SSLHandler
All,
I think I've discovered a bug in Mina's TLS code. While I encountered this bug in Mina 1.1.7 it seems the same problem code exists in Mina 2.0. The code in question is the flushScheduledEvents() method in SSLHandler.java:
public void flushScheduledEvents() {
// Fire events only when no lock is hold for this handler.
if (Thread.holdsLock(this)) {
return;
}
Event e;
// We need synchronization here inevitably because filterWrite can be
// called simultaneously and cause 'bad record MAC' integrity error.
synchronized (this) {
while ((e = filterWriteEventQueue.poll()) != null) {
e.nextFilter.filterWrite(session, (WriteRequest) e.data);
}
}
while ((e = messageReceivedEventQueue.poll()) != null) {
e.nextFilter.messageReceived(session, e.data);
}
}
This method is called both by threads which handle writes, and threads that handle reads. Therefore, as the comments suggest, multiple threads may go through this code simultaneously. However, since there is no synchronization around processing of the messageReceivedEventQueue, it is possible that the received messages will be sent to the next filter out of order, should there be more than one message in the queue and a context switch happen at the wrong time.
The bug would manifest in our application as a failure of our protocol layer to decode a message, we believe, due to a re-ordering. It only occurred in environments with a large amount of contention and network traffic and when using TLS. The fix I have tested was to move the closing brace of the synchronized block to extend to cover both while loops. I've attached a patch representing that change. Since making that change we have not encountered the bug again after about 30 hours of testing and 1.5 TB of traffic, whereas before the change we could reproduce it after a few minutes.
Jason
Re: Possible Bug in SSLHandler
Posted by Emmanuel Lécharny <el...@apache.org>.
> Emmanuel Lécharny wrote:
>>> Emmanuel,
>>>
>>> The issue isn't related to any type of corruption of concurrent
>>> access to the event queue.
>> Sorry, Jason, I reread your first mal carefully and you are most
>> certainly right.
>>
>> can you create a JIRA and post your suggested fix ? This is the best
>> way to get this patch injected and the only way for us to get a track
>> of it before we release 2.0. Set this problem as critical or even
>> better, blocker.
>>
>> Many thanks !
>>
>>
> Emmanuel,
>
> It's submitted as: DIRMINA-779, though I don't think its priority is
> set to blocker..
Ok, thanks !
--
Regards,
Cordialement,
Emmanuel Lécharny
www.nextury.com
Re: Possible Bug in SSLHandler
Posted by Jason Resch <jr...@cleversafe.com>.
Emmanuel Lécharny wrote:
>> Emmanuel,
>>
>> The issue isn't related to any type of corruption of concurrent access
>> to the event queue.
>>
> Sorry, Jason, I reread your first mal carefully and you are most
> certainly right.
>
> can you create a JIRA and post your suggested fix ? This is the best way
> to get this patch injected and the only way for us to get a track of it
> before we release 2.0. Set this problem as critical or even better, blocker.
>
> Many thanks !
>
>
>
Emmanuel,
It's submitted as: DIRMINA-779, though I don't think its priority is set
to blocker..
Thanks,
Jason
Re: Possible Bug in SSLHandler
Posted by Emmanuel Lécharny <el...@apache.org>.
> Emmanuel,
>
> The issue isn't related to any type of corruption of concurrent access
> to the event queue.
Sorry, Jason, I reread your first mal carefully and you are most
certainly right.
can you create a JIRA and post your suggested fix ? This is the best way
to get this patch injected and the only way for us to get a track of it
before we release 2.0. Set this problem as critical or even better, blocker.
Many thanks !
--
Regards,
Cordialement,
Emmanuel Lécharny
www.nextury.com
Re: Possible Bug in SSLHandler
Posted by Jason Resch <jr...@cleversafe.com>.
Emmanuel,
The issue isn't related to any type of corruption of concurrent access
to the event queue. It is that events may be sent to the next filter in
the wrong order from the order in which they were received. For
instance, the application layer might have a large protocol message
structure, lets say a few MB. While SSL deals with small records,
around a few tens of KB. While each SSL record is decrypted and put on
the Queue in the right order, the code in flushScheduleEvents does not
synchronize around the point when things are removed from the queue and
sent to the next filter. Therefore this issue should still exist if the
latest code in trunk contains this in the flushScheduledEvents() method:
| while ((e = messageReceivedEventQueue.poll()) != null) {
e.nextFilter.messageReceived(session, e.data);
}|
If there is a context switch to another thread right after an event is
polled from the thread, the next thread will take the next event off the
queue and send it to the next filter before the first thread does. That
while loop must be synchronized.
This is recognized in the above comment for writing, by the fact that
when writing, improperly ordered data will cause bad record MAC errors,
but it the same issue applies for processing reads, only it will
manifest as a protocol decoding error in the higher-level application
protocol.
Looking at the last code from trunk, the problem code still exists:
http://svn.apache.org/repos/asf/mina/trunk/core/src/main/java/org/apache/mina/filter/ssl/SslHandler.java
Ctrl+F for "flushScheduledEvents"
Jason
Emmanuel Lecharny wrote:
> On 4/3/10 6:17 PM, Bernd Fondermann wrote:
>
>> Jason,
>>
>> Thanks for your contribution!
>> Would you mind to open a JIRA bug report and attach the diff file to
>> it, checking the ASF license check box?
>>
>>
> Not sure this is still an issue in trunk, as the SSL layer has been
> totally rewritten é months ago (for instance, there is no more
> ...filter/support/... package). The filterWriteEventQueue data structure
> is now a ConcurrentLinkedQueue and protected against concurrent accesses.
>
> Can you give the trunk a try ?
>
> Thanks !--
>
> Regards,
> Cordialement,
> Emmanuel Lécharny
> www.nextury.com
>
>
>
Re: Possible Bug in SSLHandler
Posted by Emmanuel Lecharny <el...@gmail.com>.
On 4/3/10 6:17 PM, Bernd Fondermann wrote:
> Jason,
>
> Thanks for your contribution!
> Would you mind to open a JIRA bug report and attach the diff file to
> it, checking the ASF license check box?
>
Not sure this is still an issue in trunk, as the SSL layer has been
totally rewritten é months ago (for instance, there is no more
...filter/support/... package). The filterWriteEventQueue data structure
is now a ConcurrentLinkedQueue and protected against concurrent accesses.
Can you give the trunk a try ?
Thanks !--
Regards,
Cordialement,
Emmanuel Lécharny
www.nextury.com
Re: Possible Bug in SSLHandler
Posted by Bernd Fondermann <be...@googlemail.com>.
Jason,
Thanks for your contribution!
Would you mind to open a JIRA bug report and attach the diff file to
it, checking the ASF license check box?
Thanks a lot,
Bernd
On Sat, Apr 3, 2010 at 17:10, Jason Resch <jr...@cleversafe.com> wrote:
> All,
>
> I think I've discovered a bug in Mina's TLS code. While I encountered this
> bug in Mina 1.1.7 it seems the same problem code exists in Mina 2.0. The
> code in question is the flushScheduledEvents() method in SSLHandler.java:
>
> public void flushScheduledEvents() {
> // Fire events only when no lock is hold for this handler.
> if (Thread.holdsLock(this)) {
> return;
> }
>
> Event e;
> // We need synchronization here inevitably because filterWrite can
> be
> // called simultaneously and cause 'bad record MAC' integrity error.
> synchronized (this) {
> while ((e = filterWriteEventQueue.poll()) != null) {
> e.nextFilter.filterWrite(session, (WriteRequest) e.data);
> }
> }
>
> while ((e = messageReceivedEventQueue.poll()) != null) {
> e.nextFilter.messageReceived(session, e.data);
> }
> }
>
> This method is called both by threads which handle writes, and threads that
> handle reads. Therefore, as the comments suggest, multiple threads may go
> through this code simultaneously. However, since there is no
> synchronization around processing of the messageReceivedEventQueue, it is
> possible that the received messages will be sent to the next filter out of
> order, should there be more than one message in the queue and a context
> switch happen at the wrong time.
>
> The bug would manifest in our application as a failure of our protocol layer
> to decode a message, we believe, due to a re-ordering. It only occurred in
> environments with a large amount of contention and network traffic and when
> using TLS. The fix I have tested was to move the closing brace of the
> synchronized block to extend to cover both while loops. I've attached a
> patch representing that change. Since making that change we have not
> encountered the bug again after about 30 hours of testing and 1.5 TB of
> traffic, whereas before the change we could reproduce it after a few
> minutes.
>
>
> Jason
>
>