You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@trafficserver.apache.org by "Leif Hedstrom (JIRA)" <ji...@apache.org> on 2013/08/28 16:26:53 UTC

[jira] [Comment Edited] (TS-2145) Add metrics for log collation, e.g. messages sent and messages received

    [ https://issues.apache.org/jira/browse/TS-2145?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13752429#comment-13752429 ] 

Leif Hedstrom edited comment on TS-2145 at 8/28/13 2:26 PM:
------------------------------------------------------------

Nice! Couple of minor thought:

1) Why not add an additional metric for "disk full" events? E.g. in
{code}
+    case Log::FAIL:
+    case Log::FULL:
{code}
would it not make sense to treat "failures" as different events than disk full? I can see taking different actions on e.g. LOG::FULL. Unless of course the only "failure" is disk full, but then we really ought to clean up that :).


2) I'm not a huge fan of the existing #define's like 
{code}
#define LOG_INCREMENT_DYN_STAT(x) \
        RecIncrRawStat(log_rsb, mutex->thread_holding, (int) x, 1);
{code}

Since you are working in this code already, could we just eliminate e.g. LOG_INCREMENT_DYN_STAT (and the other similar ones) and just use RecIncrRawStat() directly? The issue I have is that these defines depend on certain things in the current scope, such as the mutex, making the code less readable IMO.

3) Use < 120 characters per line, e.g. this can easily fit on one line, and not two:
{code}
+    Note("logging space exhausted, can't write to:%s, drop this entry",
+         m_logFile->m_name);
{code}


4) In ::log() why use an if statement and not a switch ? E.g. why
{code}
+  if (ret & (Log::FAIL | Log::FULL)) {
+    // At least one of object logging failed.
+    LOG_INCREMENT_DYN_STAT(log_stat_event_log_access_fail_stat);
+  } else if (ret & Log::LOG_OK) {
{code}
Considering that LOG_OK is probably the predominant case, why sacrifice and do two logical operations in that case? This seems unnecessary, and inconsistent with how you use the return values elsewhere. Also, assuming the return value is only one of ReturnCodeFlags, why did you make that a bit-field instead of just a sequence? If you switch the above to a switch statement, maybe change ReturnCodeFlags back as well.

5) This is very nitpicky, but I find the naming of corresponding stats slightly confusing. We have log_stat_event_log_error_skip_stat and log_stat_event_log_access_skip_stat. The latter presumably is for log collation receiver side of log collation? Would it make sense to call them e.g. log_stat_event_log_produce_skip_stat and log_stat_event_log_consume_skip_stat ? Or perhaps _write_ and _read_ (respectively) ?
                
      was (Author: zwoop):
    Nice! Couple of minor thought:

1) Why not add an additional metric for "disk full" events? E.g. in
{code}
+    case Log::FAIL:
+    case Log::FULL:
{code]
would it not make sense to treat "failures" as different events than disk full? I can see taking different actions on e.g. LOG::FULL. Unless of course the only "failure" is disk full, but then we really ought to clean up that :).


2) I'm not a huge fan of the existing #define's like 
{code}
#define LOG_INCREMENT_DYN_STAT(x) \
        RecIncrRawStat(log_rsb, mutex->thread_holding, (int) x, 1);
{code}

Since you are working in this code already, could we just eliminate e.g. LOG_INCREMENT_DYN_STAT (and the other similar ones) and just use RecIncrRawStat() directly? The issue I have is that these defines depend on certain things in the current scope, such as the mutex, making the code less readable IMO.

3) Use < 120 characters per line, e.g. this can easily fit on one line, and not two:
{code}
+    Note("logging space exhausted, can't write to:%s, drop this entry",
+         m_logFile->m_name);
{code}


4) In ::log() why use an if statement and not a switch ? E.g. why
{code}
+  if (ret & (Log::FAIL | Log::FULL)) {
+    // At least one of object logging failed.
+    LOG_INCREMENT_DYN_STAT(log_stat_event_log_access_fail_stat);
+  } else if (ret & Log::LOG_OK) {
{code}
Considering that LOG_OK is probably the predominant case, why sacrifice and do two logical operations in that case? This seems unnecessary, and inconsistent with how you use the return values elsewhere. Also, assuming the return value is only one of ReturnCodeFlags, why did you make that a bit-field instead of just a sequence? If you switch the above to a switch statement, maybe change ReturnCodeFlags back as well.

5) This is very nitpicky, but I find the naming of corresponding stats slightly confusing. We have log_stat_event_log_error_skip_stat and log_stat_event_log_access_skip_stat. The latter presumably is for log collation receiver side of log collation? Would it make sense to call them e.g. log_stat_event_log_produce_skip_stat and log_stat_event_log_consume_skip_stat ? Or perhaps _write_ and _read_ (respectively) ?
                  
> Add metrics for log collation, e.g. messages sent and messages received
> -----------------------------------------------------------------------
>
>                 Key: TS-2145
>                 URL: https://issues.apache.org/jira/browse/TS-2145
>             Project: Traffic Server
>          Issue Type: Improvement
>          Components: Logging
>            Reporter: Yunkai Zhang
>            Assignee: Yunkai Zhang
>         Attachments: 0001-TS-2145-Add-metrics-for-log-collation.patch
>
>
> Without accurate stats, we don't known whether the log is lost.
> This coming patch will be used to verify:
> 1) At log client side, how many logs genterated from HttpSM, how many logs sent to LogServer.
> 2) At log server side, how many logs received from LogClient, how many logs written to LogServers's disk.
> ==
> I have attached a patch. Let's show an example for the metrics collected by this patch:
> In this example, there are two machines: one log-client and one log-server.
> {code}
> 1) Clear old stats data in both client and server machines:
> $ /etc/init.d/trafficserver stop
> $ rm -rf /var/run/trafficserver/*.snap
> 2) Make logs_xml.config in log server empty.
> 3) Make logs_xml.config in log client looks like:
> <LogObject>
>   <Format = "Cdn_access_log"/>
>   <CollationHosts = "<log-server-ip>:8085"/>
>   <Filename = "cdn_access"/>
>   <Protocols = "http"/>
>   <Filters = "only_get"/>
> </LogObject>
> 2) Start log server (with collation mode == 1, logging_enable = 3, squid_log_enabled = 0)
> [log-server]$ /etc/init.d/trafficserver start
> 3) Start log client (with collation mode = 0, logging_enable = 3, squid_log_enabled = 0)
> [log-server]$ /etc/init.d/trafficserver start
> 4) Use jtest do some request for log client.
> 5) Stop log client and for a while (so that all data in network have been sent to log-server).
> 6) Log client's metrics for logging:
> [log-client]# lynx -dump http://localhost:8080/stat/ | grep process.log
> proxy.process.log.event_log_error=0
> proxy.process.log.event_log_error_skip=0
> proxy.process.log.event_log_error_aggr=0
> proxy.process.log.event_log_error_fail=0
> proxy.process.log.event_log_access=4990275
> proxy.process.log.event_log_access_skip=0
> proxy.process.log.event_log_access_aggr=0
> proxy.process.log.event_log_access_fail=0
> proxy.process.log.num_sent_to_network=4990275
> proxy.process.log.num_received_from_network=0
> proxy.process.log.num_flush_to_disk=0
> proxy.process.log.bytes_sent_to_network=719149632
> proxy.process.log.bytes_received_from_network=0
> proxy.process.log.bytes_flush_to_disk=0
> proxy.process.log.bytes_written_to_disk=0
> proxy.process.log.log_files_open=0
> proxy.process.log.log_files_space_used=5330
> 7) Log server's metrics for logging:
> [log-server]# lynx -dump http://localhost:8080/stat/ | grep process.log
> proxy.process.log.event_log_error=0
> proxy.process.log.event_log_error_skip=0
> proxy.process.log.event_log_error_aggr=0
> proxy.process.log.event_log_error_fail=0
> proxy.process.log.event_log_access=0
> proxy.process.log.event_log_access_skip=20
> proxy.process.log.event_log_access_aggr=0
> proxy.process.log.event_log_access_fail=0
> proxy.process.log.num_sent_to_network=0
> proxy.process.log.num_received_from_network=4990275
> proxy.process.log.num_flush_to_disk=4990275
> proxy.process.log.bytes_sent_to_network=0
> proxy.process.log.bytes_received_from_network=719149632
> proxy.process.log.bytes_flush_to_disk=485256906
> proxy.process.log.bytes_written_to_disk=485256906
> proxy.process.log.log_files_open=1
> proxy.process.log.log_files_space_used=85215225070
> {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira