You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@trafficserver.apache.org by "Chou, Peter" <pb...@labs.att.com> on 2016/06/24 00:59:33 UTC
Please review fix for TS-4475 Log-Collation core-dump.

Hi all,

I am working on a fix for TS-4475 which is a core dump in LogCollationClientSM.cc. I gather that there is not a lot of interest in the Log Collation client so I am reaching out to the mailing list especially for folks familiar with inactivity cop.

Background: The core dump occurs when a VC_INACTIVITY_TIMEOUT_EVENT == 105 is sent to various of the state machine's event handlers This event is not coded into the switch(event) so it core dumps as follows due to the following code -

  default:
    ink_assert(!"unexpcted state");
    return EVENT_CONT;

We have been seeing this inactivity event occur recently since we changed the net inactivity default timer from 86400 to 300s (in order to use inactivity_cop to help get rid of hanging connections during an over-loaded traffic condition). We are using ATS as a log collation client, i.e., we have defined log collation hosts in logs_xml.config.

I have inserted the following code into the various event handlers to "ignore" the time-out event -

+  case VC_EVENT_INACTIVITY_TIMEOUT:
+    Note("[LogCollationClientSM] - ignoring VC_EVENT_INACTIVITY_TIMEOUT");
+    return EVENT_CONT;
  case VC_EVENT_EOS:

This seems to work (generating a NOTE in diags.log every five minutes), but I am concerned whether NetHandler::manage_active_queue() would be trying to close the VC even if I ignore the time-out event in the event handler?

Example from diags.log -

[Jun 24 00:39:35.031] Server {0x2b387ec7c700} NOTE: [LogCollationClientSM] - ignoring VC_EVENT_INACTIVITY_TIMEOUT

Debug from traffic.out -

[Jun 24 00:39:35.031] Server {0x2b387ec7c700} DEBUG: (inactivity_cop_verbose) vc: 0x2b38c0018d00 now: 1466728775031910904 timeout at
: 1466728774 timeout in: 300
[Jun 24 00:39:35.031] Server {0x2b387ec7c700} DEBUG: (inactivity_cop_verbose) vc: 0x2b38c0018d00 now: 1466728775031910904 timeout at
: 1466728774291725549 timeout in: 300000000000
[Jun 24 00:39:37.024] Server {0x2b387ec7c700} DEBUG: (inactivity_cop_verbose) vc: 0x2b38c0018d00 now: 1466728777024229232 timeout at
: 1466729076 timeout in: 300

It seems like the net effect is just to advance the time-out another ~ 300s upon each time-out. Note that the debug message numbers seem to change units at the transition point.

Appreciate any recommendations or comments on this solution.

Thanks,
Peter