You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@htrace.apache.org by "Colin Patrick McCabe (JIRA)" <ji...@apache.org> on 2015/06/29 22:06:05 UTC

[jira] [Commented] (HTRACE-200) Reduce rate of logged errors if Zipkin Collector service is down

    [ https://issues.apache.org/jira/browse/HTRACE-200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14606287#comment-14606287 ] 

Colin Patrick McCabe commented on HTRACE-200:
---------------------------------------------

I think it makes sense to rate-limit these error log messages, by only logging ones that happen every 60 seconds (or some configurable interval).  We do something similar in {{HTraceRESTReceiver.java}}.

> Reduce rate of logged errors if Zipkin Collector service is down
> ----------------------------------------------------------------
>
>                 Key: HTRACE-200
>                 URL: https://issues.apache.org/jira/browse/HTRACE-200
>             Project: HTrace
>          Issue Type: Improvement
>            Reporter: Andrew Olson
>            Priority: Minor
>
> We see a flood of errors logged by the ZipkinSpanReceiver when our Zipkin Collector service is not running - about one error every second or two, by each of our processes that are instrumented with HTrace and configured to send traces to Zipkin. Exacerbating the problem for us, it seems that with commons-logging, every line of the exception stack trace includes a prefix like "2015-06-29 09:03:25 zipkinSpanReceiver-0 STDIO [ERROR]", so that Splunk parses it as a separate error message. Here [1] is an example log file. It would be nice if this error logging could be rate-limited to something like no more than one per minute, or possibly only the initial occurrence logged until a successful send occurs to reset the state.
> [1] http://pastebin.com/AieewfhF



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)