You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@trafficserver.apache.org by "Yunkai Zhang (JIRA)" <ji...@apache.org> on 2013/07/30 17:39:51 UTC
[jira] [Comment Edited] (TS-1330) Logging related segfault in 3.2.0

    [ https://issues.apache.org/jira/browse/TS-1330?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13723962#comment-13723962 ] 

Yunkai Zhang edited comment on TS-1330 at 7/30/13 3:39 PM:
-----------------------------------------------------------

I have found the root cause of this issue, here is my analysis:

1. One thread(A) got an LogBuffer object in LogObject::_checkout_write() and ran into LogBuffer::checkout_write(), before it updated the LogBuffer::m_state, it was scheduled out by the kernel.

2. Other threads got the same LogBuffer object, but they *all* finished _checkout_write() codes and made the LogBuffer object become *full*, finally one of these thread flushed the LogBuffer object into write_list queue.

3. The LogBuffer object's m_references == 0 at this time.

4. The flushing thread would call LogBufferManager::flush_buffers() to loop write_list. As the LogBuffer's m_references == 0, so it would be flushed and *deleted*.

5. Now, thread(A) would be scheduled in by the kernel, but the LogBuffer object has been *deleted*, so the m_buffer == NULL, that is why entry_header is 0x0.

Here is a coredump generated by the my testing patch - "0001-A-testing-patch-to-verify-my-analysis.patch":
{code}
(gdb) bt
#0  0x0000003c75232885 in raise () from /lib64/libc.so.6
#1  0x0000003c75234065 in abort () from /lib64/libc.so.6
#2  0x00002b02d24bdaf8 in ink_die_die_die (retval=1) at ink_error.cc:43
#3  0x00002b02d24bdbca in ink_fatal_va(int, const char *, typedef __va_list_tag __va_list_tag *) (return_code=1,
    message_format=0x2b02d24dbe40 "%s:%d: failed assert `%s`", ap=0x2b0304211a90) at ink_error.cc:65
#4  0x00002b02d24bdc95 in ink_fatal (return_code=1, message_format=0x2b02d24dbe40 "%s:%d: failed assert `%s`")
    at ink_error.cc:73
#5  0x00002b02d24bcbc4 in _ink_assert (expression=0x74e021 "current_ref == 0", file=0x74e014 "LogObject.cc", line=60)
    at ink_assert.cc:38
#6  0x000000000060a9ab in LogBufferManager::flush_buffers (this=0x3219a08, sink=0x32199a0) at LogObject.cc:60
#7  0x00000000005f755e in LogObject::flush_buffers (this=0x3219970) at LogObject.h:126
#8  0x000000000060de55 in LogObjectManager::flush_buffers (this=0x31e84f0) at LogObject.cc:1185
#9  0x00000000005f6627 in Log::flush_thread_main (args=0x0) at Log.cc:1193
#10 0x00000000005f779b in LoggingFlushContinuation::mainEvent (this=0x31e8fb0, event=1, data=0x2b02d8e13c60) at Log.cc:297
#11 0x00000000004f1f1e in Continuation::handleEvent (this=0x31e8fb0, event=1, data=0x2b02d8e13c60)
    at ../iocore/eventsystem/I_Continuation.h:146
#12 0x000000000070fb18 in EThread::execute (this=0x2b0304013010) at UnixEThread.cc:291
#13 0x000000000070e6df in spawn_thread_internal (a=0x31e83d0) at Thread.cc:88
#14 0x0000003c756077f1 in start_thread () from /lib64/libpthread.so.0
#15 0x0000003c752e570d in clone () from /lib64/libc.so.6
(gdb) f 6
#6  0x000000000060a9ab in LogBufferManager::flush_buffers (this=0x3219a08, sink=0x32199a0) at LogObject.cc:60
60            ink_release_assert(current_ref == 0);
(gdb) l
55             * m_ref_debug could be changed by other threads
56             * currently, let's store it into local variable
57             * which is helpful for debugging.
58             */
59            int current_ref = b->m_ref_debug;
60            ink_release_assert(current_ref == 0);
61            new_q.push(b);
62          }
63        }
64
(gdb) p current_ref
$2 = 2
(gdb) p b->m_references 
$3 = 0
(gdb) 
{code}

Once we found out the root cause, fix is simple. We should refine the LogBuffer::m_references variable, so that it can *really* prevent outstanding references.
                
      was (Author: yunkai):
    I have found the root cause of this issue, here is my analysis:

1. One thread(A) got an LogBuffer object in LogObject::_checkout_write() and ran into LogBuffer::checkout_write(), before it updated the LogBuffer::m_state, it was scheduled out by the kernel.

2. Other threads got the same LogBuffer object, but they *all* finished _checkout_write() codes and made the LogBuffer object become *full*, finally one of these thread flushed the LogBuffer object into write_list queue.

3. The LogBuffer object's m_references == 0 at this time.

4. The flushing thread would call LogBufferManager::flush_buffers() to loop write_list. As the LogBuffer's m_references == 0, so it would be flushed and *deleted*.

5. Now, thread(A) would be scheduled in by the kernel, but the LogBuffer object has been *deleted*, so the m_buffer == NULL, that is why entry_header is 0x0.

Here is a coredump generated by the my testing patch - "0001-A-testing-patch-to-verify-my-analysis.patch":
{code}
(gdb) bt
#0  0x0000003c75232885 in raise () from /lib64/libc.so.6
#1  0x0000003c75234065 in abort () from /lib64/libc.so.6
#2  0x00002b02d24bdaf8 in ink_die_die_die (retval=1) at ink_error.cc:43
#3  0x00002b02d24bdbca in ink_fatal_va(int, const char *, typedef __va_list_tag __va_list_tag *) (return_code=1,
    message_format=0x2b02d24dbe40 "%s:%d: failed assert `%s`", ap=0x2b0304211a90) at ink_error.cc:65
#4  0x00002b02d24bdc95 in ink_fatal (return_code=1, message_format=0x2b02d24dbe40 "%s:%d: failed assert `%s`")
    at ink_error.cc:73
#5  0x00002b02d24bcbc4 in _ink_assert (expression=0x74e021 "current_ref == 0", file=0x74e014 "LogObject.cc", line=60)
    at ink_assert.cc:38
#6  0x000000000060a9ab in LogBufferManager::flush_buffers (this=0x3219a08, sink=0x32199a0) at LogObject.cc:60
#7  0x00000000005f755e in LogObject::flush_buffers (this=0x3219970) at LogObject.h:126
#8  0x000000000060de55 in LogObjectManager::flush_buffers (this=0x31e84f0) at LogObject.cc:1185
#9  0x00000000005f6627 in Log::flush_thread_main (args=0x0) at Log.cc:1193
#10 0x00000000005f779b in LoggingFlushContinuation::mainEvent (this=0x31e8fb0, event=1, data=0x2b02d8e13c60) at Log.cc:297
#11 0x00000000004f1f1e in Continuation::handleEvent (this=0x31e8fb0, event=1, data=0x2b02d8e13c60)
    at ../iocore/eventsystem/I_Continuation.h:146
#12 0x000000000070fb18 in EThread::execute (this=0x2b0304013010) at UnixEThread.cc:291
#13 0x000000000070e6df in spawn_thread_internal (a=0x31e83d0) at Thread.cc:88
#14 0x0000003c756077f1 in start_thread () from /lib64/libpthread.so.0
#15 0x0000003c752e570d in clone () from /lib64/libc.so.6
(gdb) f 6
#6  0x000000000060a9ab in LogBufferManager::flush_buffers (this=0x3219a08, sink=0x32199a0) at LogObject.cc:60
60            ink_release_assert(current_ref == 0);
(gdb) l
55             * m_ref_debug could be changed by other threads
56             * currently, let's store it into local variable
57             * which is helpful for debugging.
58             */
59            int current_ref = b->m_ref_debug;
60            ink_release_assert(current_ref == 0);
61            new_q.push(b);
62          }
63        }
64
(gdb) p current_ref
$2 = 2
(gdb) p b->m_references 
$3 = 0
(gdb) 
{code}

Once we found out the root cause, fix is simple. We should refine the LogBuffer::m_references variaable, so that it can *really* prevent outstanding references.
                  
> Logging related segfault in 3.2.0
> ---------------------------------
>
>                 Key: TS-1330
>                 URL: https://issues.apache.org/jira/browse/TS-1330
>             Project: Traffic Server
>          Issue Type: Bug
>          Components: Logging
>    Affects Versions: 3.2.0
>         Environment: ATS 3.2.0 on RHEL 6.2 64-bit
>            Reporter: David Carlin
>            Assignee: Bryan Call
>            Priority: Critical
>              Labels: A, crash
>             Fix For: 3.5.0
>
>
> I observed the following crash once on one of our ATS boxes - possibly related to TS-1240
> Jul  2 13:56:56 l2 traffic_server[25853]: {0x2b0a391e1700} ERROR: [SSL_NetVConnection::ssl_read_from_net] SSL_ERROR_SYSCALL, underlying IO error: Connection reset by peer
> Jul  2 13:59:56 l2 kernel: [ET_NET 1][25855]: segfault at c ip 000000000058e083 sp 00002b0a2982b740 error 6
> Jul  2 13:59:56 l2 kernel: [ET_NET 3][25857]: segfault at 84 ip 000000000058e083 sp 00002b0a29a31740 error 6 in traffic_server[400000+340000]
> Jul  2 13:59:56 l2 kernel: in traffic_server[400000+340000]
> Jul  2 14:02:59 l2 traffic_cop[25901]: (test) read timeout [180000 ]
> Jul  2 14:02:59 l2 traffic_cop[25901]: server heartbeat failed [1]
> Jul  2 14:03:08 l2 traffic_manager[25826]: {0x7f3f088607e0} FATAL: [LocalManager::pollMgmtProcessServer] Error in read (errno: 104)
> Jul  2 14:03:09 l2 traffic_manager[25826]: {0x7f3f088607e0} FATAL:  (last system error 104: Connection reset by peer)
> Jul  2 14:03:09 l2 traffic_cop[25901]: cannot find traffic_server [1]
> Jul  2 14:03:09 l2 traffic_manager[25826]: {0x7f3f088607e0} ERROR: [LocalManager::sendMgmtMsgToProcesses] Error writing message
> Jul  2 14:03:09 l2 traffic_manager[25826]: {0x7f3f088607e0} ERROR:  (last system error 32: Broken pipe)
> Jul  2 14:03:12 l2 traffic_cop[25901]: cop received child status signal [25826 35584]
> Jul  2 14:03:12 l2 traffic_cop[25901]: traffic_manager not running, making sure traffic_server is dead
> Jul  2 14:03:12 l2 traffic_cop[25901]: spawning traffic_manager
> Jul  2 14:03:13 l2 traffic_manager[18267]: NOTE: --- Manager Starting ---
> Jul  2 14:03:13 l2 traffic_manager[18267]: NOTE: Manager Version: Apache Traffic Server - traffic_manager - 3.2.0 - (build # 52518 on Jun 25 2012 at 18:22:12)
> Jul  2 14:03:13 l2 traffic_manager[18267]: {0x7fe63de3f7e0} STATUS: opened /home/y/logs/trafficserver/manager.log
> Jul  2 14:03:15 l2 traffic_server[18322]: NOTE: --- Server Starting ---
> Jul  2 14:03:15 l2 traffic_server[18322]: NOTE: Server Version: Apache Traffic Server - traffic_server - 3.2.0 - (build # 52518 on Jun 25 2012 at 18:22:31)
> Jul  2 14:03:15 l2 traffic_server[18322]: {0x2b77573ab860} STATUS: opened /home/y/logs/trafficserver/diags.log
> Jul  2 14:03:15 l2 traffic_server[18322]: {0x2b77573ab860} ERROR: Cannot insert duplicate!
> Jul  2 14:03:22 l2 traffic_cop[25901]: server heartbeat succeeded
> [Jul  2 13:56:56.304] Server {0x2b0a391e1700} ERROR: [SSL_NetVConnection::ssl_read_from_net] SSL_ERROR_SYSCALL, underlying IO error: Connection reset by peer
> NOTE: Traffic Server received Sig 11: Segmentation fault
> NOTE: Traffic Server received Sig 11: Segmentation fault
> /home/y/bin/traffic_server - STACK TRACE: 
> /home/y/bin/traffic_server - STACK TRACE: 
> /lib64/libpthread.so.0[0x3b54e0f4a0]
> /lib64/libpthread.so.0[0x3b54e0f4a0]
> /home/y/bin/traffic_server(_ZN9LogBuffer14checkout_writeEPmm+0x153)[0x58e083]
> /home/y/bin/traffic_server(_ZN9LogObject15_checkout_writeEPmm+0xa8)[0x5a64c8]
> /home/y/bin/traffic_server(_ZN9LogObject3logEP9LogAccessPc+0x2f0)[0x5a7e30]
> /home/y/bin/traffic_server(_ZN9LogBuffer14checkout_writeEPmm+0x153)[0x58e083]
> /home/y/bin/traffic_server(_ZN9LogObject15_checkout_writeEPmm+0xa8)[0x5a64c8]
> /home/y/bin/traffic_server(_ZN9LogObject3logEP9LogAccessPc+0x2f0)[0x5a7e30]
> /home/y/bin/traffic_server(_ZN3Log6accessEP9LogAccess+0x146)[0x58f506]
> /home/y/bin/traffic_server(_ZN6HttpSM12update_statsEv+0x630)[0x526c50]
> /home/y/bin/traffic_server(_ZN3Log6accessEP9LogAccess+0x146)[0x58f506]
> /home/y/bin/traffic_server(_ZN6HttpSM9kill_thisEv+0x928)[0x52b548]
> /home/y/bin/traffic_server(_ZN6HttpSM12update_statsEv+0x630)[0x526c50]
> /home/y/bin/traffic_server(_ZN6HttpSM9kill_thisEv+0x928)[0x52b548]
> /home/y/bin/traffic_server(_ZN6HttpSM12main_handlerEiPv+0x198)[0x52b868]
> /home/y/bin/traffic_server(_ZN10HttpTunnel12main_handlerEiPv+0xde)[0x56c3ee]
> /home/y/bin/traffic_server[0x6736a1]
> /home/y/bin/traffic_server(_ZN6HttpSM12main_handlerEiPv+0x198)[0x52b868]
> /home/y/bin/traffic_server(_Z15write_to_net_ioP10NetHandlerP18UnixNetVConnectionP7EThread+0x847)[0x675517]
> /home/y/bin/traffic_server[0x672f81]
> /home/y/bin/traffic_server(_ZN10NetHandler12mainNetEventEiP5Event+0x286)[0x66df96]
> /home/y/bin/traffic_server[0x676ad5]
> /home/y/bin/traffic_server(_ZN7EThread13process_eventEP5Eventi+0xb4)[0x696b14]
> /home/y/bin/traffic_server(_ZN10NetHandler12mainNetEventEiP5Event+0x1f2)[0x66df02]
> /home/y/bin/traffic_server(_ZN7EThread13process_eventEP5Eventi+0xb4)[0x696b14]
> /home/y/bin/traffic_server(_ZN7EThread7executeEv+0x4c3)[0x6974a3]
> /home/y/bin/traffic_server[0x695ae2]
> /lib64/libpthread.so.0[0x3b54e077f1]
> /lib64/libc.so.6(clone+0x6d)[0x3b54ae5ccd]
> /home/y/bin/traffic_server(_ZN7EThread7executeEv+0x4c3)[0x6974a3]
> /home/y/bin/traffic_server[0x695ae2]
> /lib64/libpthread.so.0[0x3b54e077f1]
> /lib64/libc.so.6(clone+0x6d)[0x3b54ae5ccd]
> [Jul  2 14:03:08.684] Manager {0x7f3f088607e0} FATAL: [LocalManager::pollMgmtProcessServer] Error in read (errno: 104)
> [Jul  2 14:03:09.213] Manager {0x7f3f088607e0} FATAL:  (last system error 104: Connection reset by peer)
> [Jul  2 14:03:09.213] Manager {0x7f3f088607e0} NOTE: [LocalManager::mgmtShutdown] Executing shutdown request.
> [Jul  2 14:03:09.213] Manager {0x7f3f088607e0} NOTE: [LocalManager::processShutdown] Executing process shutdown request.
> [Jul  2 14:03:09.496] Manager {0x7f3f088607e0} ERROR: [LocalManager::sendMgmtMsgToProcesses] Error writing message
> [Jul  2 14:03:09.496] Manager {0x7f3f088607e0} ERROR:  (last system error 32: Broken pipe)
> [example_alarm_bin.sh] sent alarm: l2.ycs.uls.yahoo.com [Mon Jul  2 14:03:09 2012] The TS-TM connection is broken for some reason. Either restart TS and TM or correct this error for TM to display TS statistics correctly

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira