You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@kafka.apache.org by "Matthias J. Sax" <ma...@confluent.io> on 2018/04/02 21:03:25 UTC

Re: [DISCUSS] KIP-274: Kafka Streams Skipped Records Metrics

John,

sorry for my late reply and thanks for updating the KIP.

I like your approach about "metrics are for monitoring, logs are for
debugging" -- however:

1) I don't see a connection between this and the task-level metrics that
you propose to get the metrics in `TopologyTestDriver`. I don't think
people would monitor the `TopologyTestDriver` an thus wondering why it
is important to include the metrics there? Thread-level metric might be
easier to monitor though (ie, less different metric to monitor).

2) I am a little worried about WARN level logging and that it might be
too chatty -- as you pointed out, it's about debugging, thus DEBUG level
might be better. Not 100% sure about this to be honest. What is the
general assumption about the frequency for skipped records? I could
imagine cases for which skipped records are quite frequent and thus,
WARN level logs might "flood" the logs

One final remark:

> More
> generally, I would like to establish a pattern in which we could add new
> values for the "reason" tags without needing a KIP to do so.

From my understanding, this is not feasible. Changing metrics is always
considered a public API change, and we need a KIP for any change. As we
moved away from tagging, it doesn't matter for the KIP anymore -- just
wanted to point it out.


-Matthias


On 3/30/18 2:47 PM, John Roesler wrote:
> Allrighty! The KIP is updated.
> 
> Thanks again, all, for the feedback.
> -John
> 
> On Fri, Mar 30, 2018 at 3:35 PM, John Roesler <jo...@confluent.io> wrote:
> 
>> Hey Guozhang and Bill,
>>
>> Ok, I'll update the KIP. At the risk of disturbing consensus, I'd like to
>> put it in the task instead of the thread so that it'll show up in the
>> TopologyTestDriver metrics as well.
>>
>> I'm leaning toward keeping the scope where it is right now, but if others
>> want to advocate for tossing in some more metrics, we can go that route.
>>
>> Thanks all,
>> -John
>>
>> On Fri, Mar 30, 2018 at 2:37 PM, Bill Bejeck <bb...@gmail.com> wrote:
>>
>>> Thanks for the KIP John, and sorry for the late comments.
>>>
>>> I'm on the fence with providing a single level metrics, but I think we'll
>>> have that discussion outside of this KIP.
>>>
>>>> * maintain one skipped-record metric (could be per-thread, per-task, or
>>>> per-processor-node) with no "reason"
>>>> * introduce a warn-level log detailing the topic/partition/offset and
>>>> reason of the skipped record
>>>
>>> I'm +1 on both of these suggestions.
>>>
>>> Finally, we have had requests in the past for some metrics around when
>>> persistent store removes an expired window.  Would adding that to our
>>> metrics stretch the scope of this KIP too much?
>>>
>>> Thanks again and overall I'm +1 on this KIP
>>>
>>> Bill
>>>
>>> On Fri, Mar 30, 2018 at 2:00 PM, Guozhang Wang <wa...@gmail.com>
>>> wrote:
>>>
>>>> The proposal sounds good to me. About "maintain only one level of
>>> metrics"
>>>> maybe we can discuss about that separately from this KIP since that
>>> would
>>>> be a larger scope of discussion. I agree that if we are going to
>>> maintain
>>>> only one-level metrics it should be lowest level and we would let users
>>> to
>>>> do the roll-ups themselves, but I'm still not fully convinced that we
>>>> should just provide single-level metrics, because 1) I think for
>>> different
>>>> metrics people may be interested to investigate into different
>>>> granularities, e.g. for poll / commit rate these are at the lowest
>>>> task-level metrics, while for process-rate / skip-rate they can be as
>>> low
>>>> as processor-node metrics, and 2) user-side rolling ups may not be very
>>>> straight-forward. But for 2) if someone can provide an efficient and
>>> easy
>>>> implementation of that I can be persuaded :)
>>>>
>>>> For now I'm thinking we can add the metric on thread-level, either with
>>>> finer grained ones with "reason" tag plus an aggregated one without the
>>>> tag, or just having a single aggregated metric without the tag looks
>>> good
>>>> to me.
>>>>
>>>>
>>>> Guozhang
>>>>
>>>> On Fri, Mar 30, 2018 at 8:05 AM, John Roesler <jo...@confluent.io>
>>> wrote:
>>>>
>>>>> Hey Guozhang,
>>>>>
>>>>> Thanks for the reply. Regarding JMX, I can dig it. I'll provide a
>>> list in
>>>>> the KIP. I was also thinking we'd better start a documentation page
>>> with
>>>>> the metrics listed.
>>>>>
>>>>> I'd have no problem logging a warning when we skip records. On the
>>> metric
>>>>> front, really I'm just pushing for us to maintain only one level of
>>>>> metrics. If that's more or less granular (i.e., maybe we don't have a
>>>>> metric per reason and log the reason instead), that's fine by me. I
>>> just
>>>>> don't think it provides a lot of extra value per complexity (interface
>>>> and
>>>>> implementation) to maintain roll-ups at the thread level in addition
>>> to
>>>>> lower-level metrics.
>>>>>
>>>>> How about this instead:
>>>>> * maintain one skipped-record metric (could be per-thread, per-task,
>>> or
>>>>> per-processor-node) with no "reason"
>>>>> * introduce a warn-level log detailing the topic/partition/offset and
>>>>> reason of the skipped record
>>>>>
>>>>> If you like that, I can update the KIP.
>>>>>
>>>>> Thanks,
>>>>> -John
>>>>>
>>>>>
>>>>>
>>>>> On Thu, Mar 29, 2018 at 6:22 PM, Guozhang Wang <wa...@gmail.com>
>>>> wrote:
>>>>>
>>>>>>> One thing you mention is the notion of setting alerts on coarser
>>>>> metrics
>>>>>> being easier than finer ones. All the metric alerting systems I have
>>>> used
>>>>>> make it equally easy to alert on metrics by-tag or over tags. So my
>>>>>> experience doesn't say that this is a use case. Were you thinking
>>> of an
>>>>>> alerting system that makes such a pre-aggregation valuable?
>>>>>>
>>>>>> For the commonly used JMX reporter tags will be encoded directly as
>>>> part
>>>>> of
>>>>>> the object name, and if users wants to monitor them they need to
>>> know
>>>>> these
>>>>>> values before hand. That is also why I think we do want to list all
>>> the
>>>>>> possible values of the reason tags in the KIP, since
>>>>>>
>>>>>>> In my email in response to Matthias, I gave an example of the
>>> kind of
>>>>>> scenario that would lead me as an operator to run with DEBUG on all
>>> the
>>>>>> time, since I wouldn't be sure, having seen a skipped record once,
>>> that
>>>>> it
>>>>>> would ever happen again. The solution is to capture all the
>>> available
>>>>>> information about the reason and location of skips all the time.
>>>>>>
>>>>>> That is a good point. I think we can either expose all levels
>>> metrics
>>>> as
>>>>> by
>>>>>> default, or only expose the most lower-level metrics and get rid of
>>>> other
>>>>>> levels to let users do roll-ups themselves (which will be a much
>>> larger
>>>>>> scope for discussion), or we can encourage users to not purely
>>> depend
>>>> on
>>>>>> metrics for such trouble shooting: that is to say, users only be
>>>> alerted
>>>>>> based on metrics, and we can log a info / warn log4j entry each
>>> time we
>>>>> are
>>>>>> about to skip a record all over the places, so that upon being
>>> notified
>>>>>> users can look into the logs to find the details on where / when it
>>>>>> happens. WDYT?
>>>>>>
>>>>>>
>>>>>> Guozhang
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Thu, Mar 29, 2018 at 3:57 PM, John Roesler <jo...@confluent.io>
>>>> wrote:
>>>>>>
>>>>>>> Hey Guozhang,
>>>>>>>
>>>>>>> Thanks for the review.
>>>>>>>
>>>>>>> 1.
>>>>>>> Matthias raised the same question about the "reason" tag values. I
>>>> can
>>>>>> list
>>>>>>> all possible values of the "reason" tag, but I'm thinking this
>>> level
>>>> of
>>>>>>> detail may not be KIP-worthy, maybe the code and documentation
>>> review
>>>>>> would
>>>>>>> be sufficient. If you all disagree and would like it included in
>>> the
>>>>>> KIP, I
>>>>>>> can certainly do that.
>>>>>>>
>>>>>>> If we do provide roll-up metrics, I agree with the pattern of
>>> keeping
>>>>> the
>>>>>>> same name but eliminating the tags for the dimensions that were
>>>>>> rolled-up.
>>>>>>>
>>>>>>> 2.
>>>>>>> I'm not too sure that implementation efficiency really becomes a
>>>> factor
>>>>>> in
>>>>>>> choosing whether to (by default) update one coarse metric at the
>>>> thread
>>>>>>> level or one granular metric at the processor-node level, since
>>> it's
>>>>> just
>>>>>>> one metric being updated either way. I do agree that if we were to
>>>>> update
>>>>>>> the granular metrics and multiple roll-ups, then we should
>>> consider
>>>> the
>>>>>>> efficiency.
>>>>>>>
>>>>>>> I agree it's probably not necessary to surface the metrics for all
>>>>> nodes
>>>>>>> regardless of whether they can or do skip records. Perhaps we can
>>>>> lazily
>>>>>>> register the metrics.
>>>>>>>
>>>>>>> In my email in response to Matthias, I gave an example of the
>>> kind of
>>>>>>> scenario that would lead me as an operator to run with DEBUG on
>>> all
>>>> the
>>>>>>> time, since I wouldn't be sure, having seen a skipped record once,
>>>> that
>>>>>> it
>>>>>>> would ever happen again. The solution is to capture all the
>>> available
>>>>>>> information about the reason and location of skips all the time.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> One thing you mention is the notion of setting alerts on coarser
>>>>> metrics
>>>>>>> being easier than finer ones. All the metric alerting systems I
>>> have
>>>>> used
>>>>>>> make it equally easy to alert on metrics by-tag or over tags. So
>>> my
>>>>>>> experience doesn't say that this is a use case. Were you thinking
>>> of
>>>> an
>>>>>>> alerting system that makes such a pre-aggregation valuable?
>>>>>>>
>>>>>>> Thanks again,
>>>>>>> -John
>>>>>>>
>>>>>>>
>>>>>>> On Thu, Mar 29, 2018 at 5:24 PM, Guozhang Wang <
>>> wangguoz@gmail.com>
>>>>>> wrote:
>>>>>>>
>>>>>>>> Hello John,
>>>>>>>>
>>>>>>>> Thanks for the KIP. Some comments:
>>>>>>>>
>>>>>>>> 1. Could you list all the possible values of the "reason" tag?
>>> In
>>>> the
>>>>>>> JIRA
>>>>>>>> ticket I left some potential reasons but I'm not clear if you're
>>>>> going
>>>>>> to
>>>>>>>> categorize each of them as a separate reason, or is there any
>>>>>> additional
>>>>>>>> ones you have in mind.
>>>>>>>>
>>>>>>>> Also I'm wondering if we should add another metric that do not
>>> have
>>>>> the
>>>>>>>> reason tag but aggregates among all possible reasons? This is
>>> for
>>>>> users
>>>>>>> to
>>>>>>>> easily set their alerting notifications (otherwise they have to
>>>> write
>>>>>> on
>>>>>>>> notification rule per reason) in their monitoring systems.
>>>>>>>>
>>>>>>>> 2. Note that the processor-node metrics is actually "per-thread,
>>>>>>> per-task,
>>>>>>>> per-processor-node", and today we only set the per-thread
>>> metrics
>>>> as
>>>>>> INFO
>>>>>>>> while leaving the lower two layers as DEBUG. I agree with your
>>>>> argument
>>>>>>>> that we are missing the per-client roll-up metrics today, but
>>> I'm
>>>>>>> convinced
>>>>>>>> that the right way to approach it would be
>>>>> "just-providing-the-lowest-
>>>>>>>> level
>>>>>>>> metrics only".
>>>>>>>>
>>>>>>>> Note the recoding implementation of these three levels are
>>>> different
>>>>>>>> internally today: we did not just do the rolling up to generate
>>> the
>>>>>>>> higher-level metrics from the lower level ones, but we just
>>> record
>>>>> them
>>>>>>>> separately, which means that, if we turn on multiple levels of
>>>>> metrics,
>>>>>>> we
>>>>>>>> maybe duplicate collecting some metrics. One can argue that is
>>> not
>>>>> the
>>>>>>> best
>>>>>>>> way to represent multi-level metrics collecting and reporting,
>>> but
>>>> by
>>>>>>> only
>>>>>>>> enabling thread-level metrics as INFO today, that implementation
>>>>> could
>>>>>> be
>>>>>>>> more efficient than only collecting the metrics at the lowest
>>>> level,
>>>>>> and
>>>>>>>> then do the roll-up calculations outside of the metrics classes.
>>>>>>>>
>>>>>>>> Plus, today not all processor-nodes may possibly skip records,
>>>> AFAIK
>>>>> we
>>>>>>>> will only skip records at the source, sink, window and
>>> aggregation
>>>>>>>> processor nodes, so adding a metric per processor looks like an
>>>>>> overkill
>>>>>>> to
>>>>>>>> me as well. On the other hand, from user's perspective the
>>> "reason"
>>>>> tag
>>>>>>> may
>>>>>>>> be sufficient for them to narrow down where inside the topology
>>> is
>>>>>>> causing
>>>>>>>> records to be dropped on the floor. So I think the "per-thread,
>>>>>> per-task"
>>>>>>>> level metrics should be sufficient for them in trouble shoot in
>>>> DEBUG
>>>>>>> mode,
>>>>>>>> and we can add another "per-thread" level metrics as INFO which
>>> is
>>>>>> turned
>>>>>>>> on by default. So under normal execution users still only need
>>> INFO
>>>>>> level
>>>>>>>> metrics for alerting (e.g. set alerts on all skipped-records
>>>> metrics
>>>>> as
>>>>>>>> non-zero), and then upon trouble shooting they can turn on DEBUG
>>>>>> metrics
>>>>>>> to
>>>>>>>> look into which task is actually causing the skipped records.
>>>>>>>>
>>>>>>>>
>>>>>>>> Guozhang
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Thu, Mar 29, 2018 at 2:03 PM, Matthias J. Sax <
>>>>>> matthias@confluent.io>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Thanks for the KIP John.
>>>>>>>>>
>>>>>>>>> Reading the material on the related Jiras, I am wondering what
>>>>>> `reason`
>>>>>>>>> tags you want to introduce? Can you elaborate? The KIP should
>>>> list
>>>>>>> those
>>>>>>>>> IMHO.
>>>>>>>>>
>>>>>>>>> About the fine grained metrics vs the roll-up: you say that
>>>>>>>>>
>>>>>>>>>> the coarse metric aggregates across two dimensions
>>>> simultaneously
>>>>>>>>>
>>>>>>>>> Can you elaborate why this is an issue? I am not convinced atm
>>>> that
>>>>>> we
>>>>>>>>> should put the fine grained metrics into INFO level and remove
>>>> the
>>>>>>>>> roll-up at thread level.
>>>>>>>>>
>>>>>>>>>> Given that they have to do this sum to get a usable
>>> top-level
>>>>> view
>>>>>>>>>
>>>>>>>>> This is a fair concern, but I don't share the conclusion.
>>>> Offering
>>>>> a
>>>>>>>>> built-in `KafkaStreams` "client" roll-up out of the box might
>>> be
>>>> a
>>>>>>>>> better solution. In the past we did not offer this due to
>>>>> performance
>>>>>>>>> concerns, but we could allow an "opt-in" mechanism. If you
>>>>> disagree,
>>>>>>> can
>>>>>>>>> you provide some reasoning and add them to the "Rejected
>>>>>> alternatives"
>>>>>>>>> section.
>>>>>>>>>
>>>>>>>>> To rephrase: I understand the issue about missing top-level
>>> view,
>>>>> but
>>>>>>>>> instead of going more fine grained, we should consider to add
>>>> this
>>>>>>>>> top-level view and add/keep the fine grained metrics at DEBUG
>>>> level
>>>>>>> only
>>>>>>>>>
>>>>>>>>> I am +1 to add TopologyTestDriver#metrics() and to remove old
>>>>> metrics
>>>>>>>>> directly as you suggested.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> -Matthias
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On 3/28/18 6:42 PM, Ted Yu wrote:
>>>>>>>>>> Looks good to me.
>>>>>>>>>>
>>>>>>>>>> On Wed, Mar 28, 2018 at 3:11 PM, John Roesler <
>>>> john@confluent.io
>>>>>>
>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> Hello all,
>>>>>>>>>>>
>>>>>>>>>>> I am proposing KIP-274 to improve the metrics around
>>> skipped
>>>>>> records
>>>>>>>> in
>>>>>>>>>>> Streams.
>>>>>>>>>>>
>>>>>>>>>>> Please find the details here:
>>>>>>>>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-
>>>>>>>>>>> 274%3A+Kafka+Streams+Skipped+Records+Metrics
>>>>>>>>>>>
>>>>>>>>>>> Please let me know what you think!
>>>>>>>>>>>
>>>>>>>>>>> Thanks,
>>>>>>>>>>> -John
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> -- Guozhang
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> -- Guozhang
>>>>>>
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> -- Guozhang
>>>>
>>>
>>
>>
>

Re: [DISCUSS] KIP-274: Kafka Streams Skipped Records Metrics

Posted by "Matthias J. Sax" <ma...@confluent.io>.

I am fine adding `TopologyTestDriver#metrics()`. It should be helpful to
test custom metrics users can implement.

-Matthias


On 4/3/18 11:35 AM, John Roesler wrote:
> Oh, sorry, I missed the point.
> 
> Yeah, we can totally do that. The reason to move it to the task level was
> mainly to make it available for the metrics in TopologyTestDriver as well.
> But if we decide that's a non-goal, then there's no motivation to change it.
> 
> And actually that reminds me that we do have an open question about whether
> I should add a metrics getter to the TopologyTestDriver's interface. WDYT?
> 
> Thanks,
> -John
> 
> On Tue, Apr 3, 2018 at 1:26 PM, Guozhang Wang <wa...@gmail.com> wrote:
> 
>> I think Matthias' comment is that, we can still record the metrics on the
>> thread-level, while having the WARN log entry to include sufficient context
>> information so that users can still easily narrow down the investigation
>> scope.
>>
>>
>> Guozhang
>>
>> On Tue, Apr 3, 2018 at 11:22 AM, John Roesler <jo...@confluent.io> wrote:
>>
>>> I agree we should add as much information as is reasonable to the log.
>> For
>>> example, see this WIP PR I started for this KIP:
>>>
>>> https://github.com/apache/kafka/pull/4812/files#diff-
>>> 88d129f048bc842c7db5b2566a45fce8R80
>>>
>>> and
>>>
>>> https://github.com/apache/kafka/pull/4812/files#diff-
>>> 69e6789eb675ec978a1abd24fed96eb1R111
>>>
>>> I'm not sure if we should nail down the log messages in the KIP or in the
>>> PR discussion. What say you?
>>>
>>> Thanks,
>>> -John
>>>
>>> On Tue, Apr 3, 2018 at 12:20 AM, Matthias J. Sax <ma...@confluent.io>
>>> wrote:
>>>
>>>> Thanks for sharing your thoughts. As I mentioned originally, I am not
>>>> sure about the right log level either. Your arguments are convincing --
>>>> thus, I am fine with keeping WARN level.
>>>>
>>>> The task vs thread level argument is an interesting one. However, I am
>>>> wondering if we should add this information into the corresponding WARN
>>>> logs that we write anyway? For this case, we can also log the
>>>> corresponding operator (and other information like topic name etc if
>>>> needed). WDYT about this?
>>>>
>>>>
>>>> -Matthias
>>>>
>>>> On 4/2/18 8:31 PM, Guozhang Wang wrote:
>>>>> Regarding logging: I'm inclined to keep logging at WARN level since
>>>> skipped
>>>>> records are not expected in normal execution (for all reasons that we
>>> are
>>>>> aware of), and hence when error happens users should be alerted from
>>>>> metrics and looked into the log files, so to me if it is really
>>> spamming
>>>>> the log files it is also a good alert for users. Besides for
>>> deserialize
>>>>> errors we already log at WARN level for this reason.
>>>>>
>>>>> Regarding the metrics-levels: I was pondering on that as well. What
>>> made
>>>> me
>>>>> to think and agree on task-level than thread-level is that for some
>>>> reasons
>>>>> like window retention, they may possibly be happening on a subset of
>>>> input
>>>>> partitions, and tasks are correlated with partitions the task-level
>>>> metrics
>>>>> can help users to narrow down on the specific input data partitions.
>>>>>
>>>>>
>>>>> Guozhang
>>>>>
>>>>>
>>>>> On Mon, Apr 2, 2018 at 6:43 PM, John Roesler <jo...@confluent.io>
>>> wrote:
>>>>>
>>>>>> Hi Matthias,
>>>>>>
>>>>>> No worries! Thanks for the reply.
>>>>>>
>>>>>> 1) There isn't a connection. I tried using the TopologyTestDriver to
>>>> write
>>>>>> a quick test exercising the current behavior and discovered that the
>>>>>> metrics weren't available. It seemed like they should be, so I
>> tacked
>>>> it on
>>>>>> to this KIP. If you feel it's inappropriate, I can pull it back out.
>>>>>>
>>>>>> 2) I was also concerned about that, but I figured it would come up
>> in
>>>>>> discussion if I just went ahead and proposed it. And here we are!
>>>>>>
>>>>>> Here's my thought: maybe there are two classes of skips:
>> "controlled"
>>>> and
>>>>>> "uncontrolled", where "controlled" means, as an app author, I
>>>> deliberately
>>>>>> filter out some events, and "uncontrolled" means that I simply don't
>>>>>> account for some feature of the data, and the framework skips them
>> (as
>>>>>> opposed to crashing).
>>>>>>
>>>>>> In this breakdowns, the skips I'm adding metrics for are all
>>>> uncontrolled
>>>>>> skips (and we hope to measure all the uncontrolled skips). Our skips
>>> are
>>>>>> well documented, so it wouldn't be terrible to have an application
>> in
>>>> which
>>>>>> you know you expect to have tons of uncontrolled skips, but it's not
>>>> great
>>>>>> either, since you may also have some *unexpected* uncontrolled
>> skips.
>>>> It'll
>>>>>> be difficult to notice, since you're probably not alerting on the
>>> metric
>>>>>> and filtering out the logs (whatever their level).
>>>>>>
>>>>>> I'd recommend any app author, as an alternative, to convert all
>>> expected
>>>>>> skips to controlled ones, by updating the topology to filter those
>>>> records
>>>>>> out.
>>>>>>
>>>>>> Following from my recommendation, as a library author, I'm inclined
>> to
>>>> mark
>>>>>> those logs WARN, since in my opinion, they should be concerning to
>> the
>>>> app
>>>>>> authors. I'd definitely want to show, rather than hide, them by
>>>> default, so
>>>>>> I would pick INFO at least.
>>>>>>
>>>>>> That said, logging is always a tricky issue for lower-level
>> libraries
>>>> that
>>>>>> run inside user code, since we don't have all the information we
>> need
>>> to
>>>>>> make the right call.
>>>>>>
>>>>>>
>>>>>>
>>>>>> On your last note, yeah, I got that impression from Guozhang as
>> well.
>>>>>> Thanks for the clarification.
>>>>>>
>>>>>> -John
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Mon, Apr 2, 2018 at 4:03 PM, Matthias J. Sax <
>>> matthias@confluent.io>
>>>>>> wrote:
>>>>>>
>>>>>>> John,
>>>>>>>
>>>>>>> sorry for my late reply and thanks for updating the KIP.
>>>>>>>
>>>>>>> I like your approach about "metrics are for monitoring, logs are
>> for
>>>>>>> debugging" -- however:
>>>>>>>
>>>>>>> 1) I don't see a connection between this and the task-level metrics
>>>> that
>>>>>>> you propose to get the metrics in `TopologyTestDriver`. I don't
>> think
>>>>>>> people would monitor the `TopologyTestDriver` an thus wondering why
>>> it
>>>>>>> is important to include the metrics there? Thread-level metric
>> might
>>> be
>>>>>>> easier to monitor though (ie, less different metric to monitor).
>>>>>>>
>>>>>>> 2) I am a little worried about WARN level logging and that it might
>>> be
>>>>>>> too chatty -- as you pointed out, it's about debugging, thus DEBUG
>>>> level
>>>>>>> might be better. Not 100% sure about this to be honest. What is the
>>>>>>> general assumption about the frequency for skipped records? I could
>>>>>>> imagine cases for which skipped records are quite frequent and
>> thus,
>>>>>>> WARN level logs might "flood" the logs
>>>>>>>
>>>>>>> One final remark:
>>>>>>>
>>>>>>>> More
>>>>>>>> generally, I would like to establish a pattern in which we could
>> add
>>>>>> new
>>>>>>>> values for the "reason" tags without needing a KIP to do so.
>>>>>>>
>>>>>>> From my understanding, this is not feasible. Changing metrics is
>>> always
>>>>>>> considered a public API change, and we need a KIP for any change.
>> As
>>> we
>>>>>>> moved away from tagging, it doesn't matter for the KIP anymore --
>>> just
>>>>>>> wanted to point it out.
>>>>>>>
>>>>>>>
>>>>>>> -Matthias
>>>>>>>
>>>>>>>
>>>>>>> On 3/30/18 2:47 PM, John Roesler wrote:
>>>>>>>> Allrighty! The KIP is updated.
>>>>>>>>
>>>>>>>> Thanks again, all, for the feedback.
>>>>>>>> -John
>>>>>>>>
>>>>>>>> On Fri, Mar 30, 2018 at 3:35 PM, John Roesler <jo...@confluent.io>
>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Hey Guozhang and Bill,
>>>>>>>>>
>>>>>>>>> Ok, I'll update the KIP. At the risk of disturbing consensus, I'd
>>>> like
>>>>>>> to
>>>>>>>>> put it in the task instead of the thread so that it'll show up in
>>> the
>>>>>>>>> TopologyTestDriver metrics as well.
>>>>>>>>>
>>>>>>>>> I'm leaning toward keeping the scope where it is right now, but
>> if
>>>>>>> others
>>>>>>>>> want to advocate for tossing in some more metrics, we can go that
>>>>>> route.
>>>>>>>>>
>>>>>>>>> Thanks all,
>>>>>>>>> -John
>>>>>>>>>
>>>>>>>>> On Fri, Mar 30, 2018 at 2:37 PM, Bill Bejeck <bb...@gmail.com>
>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> Thanks for the KIP John, and sorry for the late comments.
>>>>>>>>>>
>>>>>>>>>> I'm on the fence with providing a single level metrics, but I
>>> think
>>>>>>> we'll
>>>>>>>>>> have that discussion outside of this KIP.
>>>>>>>>>>
>>>>>>>>>>> * maintain one skipped-record metric (could be per-thread,
>>>> per-task,
>>>>>>> or
>>>>>>>>>>> per-processor-node) with no "reason"
>>>>>>>>>>> * introduce a warn-level log detailing the
>> topic/partition/offset
>>>>>> and
>>>>>>>>>>> reason of the skipped record
>>>>>>>>>>
>>>>>>>>>> I'm +1 on both of these suggestions.
>>>>>>>>>>
>>>>>>>>>> Finally, we have had requests in the past for some metrics
>> around
>>>>>> when
>>>>>>>>>> persistent store removes an expired window.  Would adding that
>> to
>>>> our
>>>>>>>>>> metrics stretch the scope of this KIP too much?
>>>>>>>>>>
>>>>>>>>>> Thanks again and overall I'm +1 on this KIP
>>>>>>>>>>
>>>>>>>>>> Bill
>>>>>>>>>>
>>>>>>>>>> On Fri, Mar 30, 2018 at 2:00 PM, Guozhang Wang <
>>> wangguoz@gmail.com>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> The proposal sounds good to me. About "maintain only one level
>> of
>>>>>>>>>> metrics"
>>>>>>>>>>> maybe we can discuss about that separately from this KIP since
>>> that
>>>>>>>>>> would
>>>>>>>>>>> be a larger scope of discussion. I agree that if we are going
>> to
>>>>>>>>>> maintain
>>>>>>>>>>> only one-level metrics it should be lowest level and we would
>> let
>>>>>>> users
>>>>>>>>>> to
>>>>>>>>>>> do the roll-ups themselves, but I'm still not fully convinced
>>> that
>>>>>> we
>>>>>>>>>>> should just provide single-level metrics, because 1) I think
>> for
>>>>>>>>>> different
>>>>>>>>>>> metrics people may be interested to investigate into different
>>>>>>>>>>> granularities, e.g. for poll / commit rate these are at the
>>> lowest
>>>>>>>>>>> task-level metrics, while for process-rate / skip-rate they can
>>> be
>>>>>> as
>>>>>>>>>> low
>>>>>>>>>>> as processor-node metrics, and 2) user-side rolling ups may not
>>> be
>>>>>>> very
>>>>>>>>>>> straight-forward. But for 2) if someone can provide an
>> efficient
>>>> and
>>>>>>>>>> easy
>>>>>>>>>>> implementation of that I can be persuaded :)
>>>>>>>>>>>
>>>>>>>>>>> For now I'm thinking we can add the metric on thread-level,
>>> either
>>>>>>> with
>>>>>>>>>>> finer grained ones with "reason" tag plus an aggregated one
>>> without
>>>>>>> the
>>>>>>>>>>> tag, or just having a single aggregated metric without the tag
>>>> looks
>>>>>>>>>> good
>>>>>>>>>>> to me.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Guozhang
>>>>>>>>>>>
>>>>>>>>>>> On Fri, Mar 30, 2018 at 8:05 AM, John Roesler <
>> john@confluent.io
>>>>
>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Hey Guozhang,
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks for the reply. Regarding JMX, I can dig it. I'll
>> provide
>>> a
>>>>>>>>>> list in
>>>>>>>>>>>> the KIP. I was also thinking we'd better start a documentation
>>>> page
>>>>>>>>>> with
>>>>>>>>>>>> the metrics listed.
>>>>>>>>>>>>
>>>>>>>>>>>> I'd have no problem logging a warning when we skip records. On
>>> the
>>>>>>>>>> metric
>>>>>>>>>>>> front, really I'm just pushing for us to maintain only one
>> level
>>>> of
>>>>>>>>>>>> metrics. If that's more or less granular (i.e., maybe we don't
>>>>>> have a
>>>>>>>>>>>> metric per reason and log the reason instead), that's fine by
>>> me.
>>>> I
>>>>>>>>>> just
>>>>>>>>>>>> don't think it provides a lot of extra value per complexity
>>>>>>> (interface
>>>>>>>>>>> and
>>>>>>>>>>>> implementation) to maintain roll-ups at the thread level in
>>>>>> addition
>>>>>>>>>> to
>>>>>>>>>>>> lower-level metrics.
>>>>>>>>>>>>
>>>>>>>>>>>> How about this instead:
>>>>>>>>>>>> * maintain one skipped-record metric (could be per-thread,
>>>>>> per-task,
>>>>>>>>>> or
>>>>>>>>>>>> per-processor-node) with no "reason"
>>>>>>>>>>>> * introduce a warn-level log detailing the
>>> topic/partition/offset
>>>>>> and
>>>>>>>>>>>> reason of the skipped record
>>>>>>>>>>>>
>>>>>>>>>>>> If you like that, I can update the KIP.
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks,
>>>>>>>>>>>> -John
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Thu, Mar 29, 2018 at 6:22 PM, Guozhang Wang <
>>>> wangguoz@gmail.com
>>>>>>>
>>>>>>>>>>> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>>> One thing you mention is the notion of setting alerts on
>>> coarser
>>>>>>>>>>>> metrics
>>>>>>>>>>>>> being easier than finer ones. All the metric alerting
>> systems I
>>>>>> have
>>>>>>>>>>> used
>>>>>>>>>>>>> make it equally easy to alert on metrics by-tag or over tags.
>>> So
>>>>>> my
>>>>>>>>>>>>> experience doesn't say that this is a use case. Were you
>>> thinking
>>>>>>>>>> of an
>>>>>>>>>>>>> alerting system that makes such a pre-aggregation valuable?
>>>>>>>>>>>>>
>>>>>>>>>>>>> For the commonly used JMX reporter tags will be encoded
>>> directly
>>>>>> as
>>>>>>>>>>> part
>>>>>>>>>>>> of
>>>>>>>>>>>>> the object name, and if users wants to monitor them they need
>>> to
>>>>>>>>>> know
>>>>>>>>>>>> these
>>>>>>>>>>>>> values before hand. That is also why I think we do want to
>> list
>>>>>> all
>>>>>>>>>> the
>>>>>>>>>>>>> possible values of the reason tags in the KIP, since
>>>>>>>>>>>>>
>>>>>>>>>>>>>> In my email in response to Matthias, I gave an example of
>> the
>>>>>>>>>> kind of
>>>>>>>>>>>>> scenario that would lead me as an operator to run with DEBUG
>> on
>>>>>> all
>>>>>>>>>> the
>>>>>>>>>>>>> time, since I wouldn't be sure, having seen a skipped record
>>>> once,
>>>>>>>>>> that
>>>>>>>>>>>> it
>>>>>>>>>>>>> would ever happen again. The solution is to capture all the
>>>>>>>>>> available
>>>>>>>>>>>>> information about the reason and location of skips all the
>>> time.
>>>>>>>>>>>>>
>>>>>>>>>>>>> That is a good point. I think we can either expose all levels
>>>>>>>>>> metrics
>>>>>>>>>>> as
>>>>>>>>>>>> by
>>>>>>>>>>>>> default, or only expose the most lower-level metrics and get
>>> rid
>>>>>> of
>>>>>>>>>>> other
>>>>>>>>>>>>> levels to let users do roll-ups themselves (which will be a
>>> much
>>>>>>>>>> larger
>>>>>>>>>>>>> scope for discussion), or we can encourage users to not
>> purely
>>>>>>>>>> depend
>>>>>>>>>>> on
>>>>>>>>>>>>> metrics for such trouble shooting: that is to say, users only
>>> be
>>>>>>>>>>> alerted
>>>>>>>>>>>>> based on metrics, and we can log a info / warn log4j entry
>> each
>>>>>>>>>> time we
>>>>>>>>>>>> are
>>>>>>>>>>>>> about to skip a record all over the places, so that upon
>> being
>>>>>>>>>> notified
>>>>>>>>>>>>> users can look into the logs to find the details on where /
>>> when
>>>>>> it
>>>>>>>>>>>>> happens. WDYT?
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> Guozhang
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Thu, Mar 29, 2018 at 3:57 PM, John Roesler <
>>> john@confluent.io
>>>>>
>>>>>>>>>>> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Hey Guozhang,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thanks for the review.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> 1.
>>>>>>>>>>>>>> Matthias raised the same question about the "reason" tag
>>> values.
>>>>>> I
>>>>>>>>>>> can
>>>>>>>>>>>>> list
>>>>>>>>>>>>>> all possible values of the "reason" tag, but I'm thinking
>> this
>>>>>>>>>> level
>>>>>>>>>>> of
>>>>>>>>>>>>>> detail may not be KIP-worthy, maybe the code and
>> documentation
>>>>>>>>>> review
>>>>>>>>>>>>> would
>>>>>>>>>>>>>> be sufficient. If you all disagree and would like it
>> included
>>> in
>>>>>>>>>> the
>>>>>>>>>>>>> KIP, I
>>>>>>>>>>>>>> can certainly do that.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> If we do provide roll-up metrics, I agree with the pattern
>> of
>>>>>>>>>> keeping
>>>>>>>>>>>> the
>>>>>>>>>>>>>> same name but eliminating the tags for the dimensions that
>>> were
>>>>>>>>>>>>> rolled-up.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> 2.
>>>>>>>>>>>>>> I'm not too sure that implementation efficiency really
>>> becomes a
>>>>>>>>>>> factor
>>>>>>>>>>>>> in
>>>>>>>>>>>>>> choosing whether to (by default) update one coarse metric at
>>> the
>>>>>>>>>>> thread
>>>>>>>>>>>>>> level or one granular metric at the processor-node level,
>>> since
>>>>>>>>>> it's
>>>>>>>>>>>> just
>>>>>>>>>>>>>> one metric being updated either way. I do agree that if we
>>> were
>>>>>> to
>>>>>>>>>>>> update
>>>>>>>>>>>>>> the granular metrics and multiple roll-ups, then we should
>>>>>>>>>> consider
>>>>>>>>>>> the
>>>>>>>>>>>>>> efficiency.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I agree it's probably not necessary to surface the metrics
>> for
>>>>>> all
>>>>>>>>>>>> nodes
>>>>>>>>>>>>>> regardless of whether they can or do skip records. Perhaps
>> we
>>>> can
>>>>>>>>>>>> lazily
>>>>>>>>>>>>>> register the metrics.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> In my email in response to Matthias, I gave an example of
>> the
>>>>>>>>>> kind of
>>>>>>>>>>>>>> scenario that would lead me as an operator to run with DEBUG
>>> on
>>>>>>>>>> all
>>>>>>>>>>> the
>>>>>>>>>>>>>> time, since I wouldn't be sure, having seen a skipped record
>>>>>> once,
>>>>>>>>>>> that
>>>>>>>>>>>>> it
>>>>>>>>>>>>>> would ever happen again. The solution is to capture all the
>>>>>>>>>> available
>>>>>>>>>>>>>> information about the reason and location of skips all the
>>> time.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> One thing you mention is the notion of setting alerts on
>>> coarser
>>>>>>>>>>>> metrics
>>>>>>>>>>>>>> being easier than finer ones. All the metric alerting
>> systems
>>> I
>>>>>>>>>> have
>>>>>>>>>>>> used
>>>>>>>>>>>>>> make it equally easy to alert on metrics by-tag or over
>> tags.
>>> So
>>>>>>>>>> my
>>>>>>>>>>>>>> experience doesn't say that this is a use case. Were you
>>>> thinking
>>>>>>>>>> of
>>>>>>>>>>> an
>>>>>>>>>>>>>> alerting system that makes such a pre-aggregation valuable?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thanks again,
>>>>>>>>>>>>>> -John
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Thu, Mar 29, 2018 at 5:24 PM, Guozhang Wang <
>>>>>>>>>> wangguoz@gmail.com>
>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Hello John,
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Thanks for the KIP. Some comments:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> 1. Could you list all the possible values of the "reason"
>>> tag?
>>>>>>>>>> In
>>>>>>>>>>> the
>>>>>>>>>>>>>> JIRA
>>>>>>>>>>>>>>> ticket I left some potential reasons but I'm not clear if
>>>> you're
>>>>>>>>>>>> going
>>>>>>>>>>>>> to
>>>>>>>>>>>>>>> categorize each of them as a separate reason, or is there
>> any
>>>>>>>>>>>>> additional
>>>>>>>>>>>>>>> ones you have in mind.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Also I'm wondering if we should add another metric that do
>>> not
>>>>>>>>>> have
>>>>>>>>>>>> the
>>>>>>>>>>>>>>> reason tag but aggregates among all possible reasons? This
>> is
>>>>>>>>>> for
>>>>>>>>>>>> users
>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>> easily set their alerting notifications (otherwise they
>> have
>>> to
>>>>>>>>>>> write
>>>>>>>>>>>>> on
>>>>>>>>>>>>>>> notification rule per reason) in their monitoring systems.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> 2. Note that the processor-node metrics is actually
>>>> "per-thread,
>>>>>>>>>>>>>> per-task,
>>>>>>>>>>>>>>> per-processor-node", and today we only set the per-thread
>>>>>>>>>> metrics
>>>>>>>>>>> as
>>>>>>>>>>>>> INFO
>>>>>>>>>>>>>>> while leaving the lower two layers as DEBUG. I agree with
>>> your
>>>>>>>>>>>> argument
>>>>>>>>>>>>>>> that we are missing the per-client roll-up metrics today,
>> but
>>>>>>>>>> I'm
>>>>>>>>>>>>>> convinced
>>>>>>>>>>>>>>> that the right way to approach it would be
>>>>>>>>>>>> "just-providing-the-lowest-
>>>>>>>>>>>>>>> level
>>>>>>>>>>>>>>> metrics only".
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Note the recoding implementation of these three levels are
>>>>>>>>>>> different
>>>>>>>>>>>>>>> internally today: we did not just do the rolling up to
>>> generate
>>>>>>>>>> the
>>>>>>>>>>>>>>> higher-level metrics from the lower level ones, but we just
>>>>>>>>>> record
>>>>>>>>>>>> them
>>>>>>>>>>>>>>> separately, which means that, if we turn on multiple levels
>>> of
>>>>>>>>>>>> metrics,
>>>>>>>>>>>>>> we
>>>>>>>>>>>>>>> maybe duplicate collecting some metrics. One can argue that
>>> is
>>>>>>>>>> not
>>>>>>>>>>>> the
>>>>>>>>>>>>>> best
>>>>>>>>>>>>>>> way to represent multi-level metrics collecting and
>>> reporting,
>>>>>>>>>> but
>>>>>>>>>>> by
>>>>>>>>>>>>>> only
>>>>>>>>>>>>>>> enabling thread-level metrics as INFO today, that
>>>> implementation
>>>>>>>>>>>> could
>>>>>>>>>>>>> be
>>>>>>>>>>>>>>> more efficient than only collecting the metrics at the
>> lowest
>>>>>>>>>>> level,
>>>>>>>>>>>>> and
>>>>>>>>>>>>>>> then do the roll-up calculations outside of the metrics
>>>> classes.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Plus, today not all processor-nodes may possibly skip
>>> records,
>>>>>>>>>>> AFAIK
>>>>>>>>>>>> we
>>>>>>>>>>>>>>> will only skip records at the source, sink, window and
>>>>>>>>>> aggregation
>>>>>>>>>>>>>>> processor nodes, so adding a metric per processor looks
>> like
>>> an
>>>>>>>>>>>>> overkill
>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>> me as well. On the other hand, from user's perspective the
>>>>>>>>>> "reason"
>>>>>>>>>>>> tag
>>>>>>>>>>>>>> may
>>>>>>>>>>>>>>> be sufficient for them to narrow down where inside the
>>> topology
>>>>>>>>>> is
>>>>>>>>>>>>>> causing
>>>>>>>>>>>>>>> records to be dropped on the floor. So I think the
>>> "per-thread,
>>>>>>>>>>>>> per-task"
>>>>>>>>>>>>>>> level metrics should be sufficient for them in trouble
>> shoot
>>> in
>>>>>>>>>>> DEBUG
>>>>>>>>>>>>>> mode,
>>>>>>>>>>>>>>> and we can add another "per-thread" level metrics as INFO
>>> which
>>>>>>>>>> is
>>>>>>>>>>>>> turned
>>>>>>>>>>>>>>> on by default. So under normal execution users still only
>>> need
>>>>>>>>>> INFO
>>>>>>>>>>>>> level
>>>>>>>>>>>>>>> metrics for alerting (e.g. set alerts on all
>> skipped-records
>>>>>>>>>>> metrics
>>>>>>>>>>>> as
>>>>>>>>>>>>>>> non-zero), and then upon trouble shooting they can turn on
>>>> DEBUG
>>>>>>>>>>>>> metrics
>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>> look into which task is actually causing the skipped
>> records.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Guozhang
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Thu, Mar 29, 2018 at 2:03 PM, Matthias J. Sax <
>>>>>>>>>>>>> matthias@confluent.io>
>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Thanks for the KIP John.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Reading the material on the related Jiras, I am wondering
>>> what
>>>>>>>>>>>>> `reason`
>>>>>>>>>>>>>>>> tags you want to introduce? Can you elaborate? The KIP
>>> should
>>>>>>>>>>> list
>>>>>>>>>>>>>> those
>>>>>>>>>>>>>>>> IMHO.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> About the fine grained metrics vs the roll-up: you say
>> that
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> the coarse metric aggregates across two dimensions
>>>>>>>>>>> simultaneously
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Can you elaborate why this is an issue? I am not convinced
>>> atm
>>>>>>>>>>> that
>>>>>>>>>>>>> we
>>>>>>>>>>>>>>>> should put the fine grained metrics into INFO level and
>>> remove
>>>>>>>>>>> the
>>>>>>>>>>>>>>>> roll-up at thread level.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Given that they have to do this sum to get a usable
>>>>>>>>>> top-level
>>>>>>>>>>>> view
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> This is a fair concern, but I don't share the conclusion.
>>>>>>>>>>> Offering
>>>>>>>>>>>> a
>>>>>>>>>>>>>>>> built-in `KafkaStreams` "client" roll-up out of the box
>>> might
>>>>>>>>>> be
>>>>>>>>>>> a
>>>>>>>>>>>>>>>> better solution. In the past we did not offer this due to
>>>>>>>>>>>> performance
>>>>>>>>>>>>>>>> concerns, but we could allow an "opt-in" mechanism. If you
>>>>>>>>>>>> disagree,
>>>>>>>>>>>>>> can
>>>>>>>>>>>>>>>> you provide some reasoning and add them to the "Rejected
>>>>>>>>>>>>> alternatives"
>>>>>>>>>>>>>>>> section.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> To rephrase: I understand the issue about missing
>> top-level
>>>>>>>>>> view,
>>>>>>>>>>>> but
>>>>>>>>>>>>>>>> instead of going more fine grained, we should consider to
>>> add
>>>>>>>>>>> this
>>>>>>>>>>>>>>>> top-level view and add/keep the fine grained metrics at
>>> DEBUG
>>>>>>>>>>> level
>>>>>>>>>>>>>> only
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I am +1 to add TopologyTestDriver#metrics() and to remove
>>> old
>>>>>>>>>>>> metrics
>>>>>>>>>>>>>>>> directly as you suggested.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> -Matthias
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On 3/28/18 6:42 PM, Ted Yu wrote:
>>>>>>>>>>>>>>>>> Looks good to me.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> On Wed, Mar 28, 2018 at 3:11 PM, John Roesler <
>>>>>>>>>>> john@confluent.io
>>>>>>>>>>>>>
>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Hello all,
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> I am proposing KIP-274 to improve the metrics around
>>>>>>>>>> skipped
>>>>>>>>>>>>> records
>>>>>>>>>>>>>>> in
>>>>>>>>>>>>>>>>>> Streams.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Please find the details here:
>>>>>>>>>>>>>>>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-
>>>>>>>>>>>>>>>>>> 274%3A+Kafka+Streams+Skipped+Records+Metrics
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Please let me know what you think!
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>>> -John
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>> -- Guozhang
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> --
>>>>>>>>>>>>> -- Guozhang
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> --
>>>>>>>>>>> -- Guozhang
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>
>>
>>
>>
>> --
>> -- Guozhang
>>
>

Re: [DISCUSS] KIP-274: Kafka Streams Skipped Records Metrics

Posted by "Matthias J. Sax" <ma...@confluent.io>.

Thanks for updating the KIP. LGTM.

-Matthias


On 4/4/18 12:46 PM, John Roesler wrote:
> Cool, if you're satisfied with the KIP now, maybe I can lobby for your vote
> ;)
> 
> The vote thread is still at only one binding +1, I think.
> 
> Thanks,
> -John
> 
> On Tue, Apr 3, 2018 at 9:05 PM, Matthias J. Sax <ma...@confluent.io>
> wrote:
> 
>> Sounds great!
>>
>> The cryptic topic names can be an issue -- however, people can
>> `describe()`  their topology to map the name to the corresponding
>> sub-topology/tasks to narrow the error down to the corresponding
>> operators. I think, this should be "sufficient for now" for debugging.
>>
>> Renaming those topic seems to be out-of-scope for this KIP.
>>
>>
>> -Matthias
>>
>> On 4/3/18 2:45 PM, Guozhang Wang wrote:
>>> Thanks John, your proposal looks fine to me.
>>>
>>> I'll go ahead and look into the PR for more details myself.
>>>
>>>
>>> Guozhang
>>>
>>> On Tue, Apr 3, 2018 at 1:35 PM, Bill Bejeck <bb...@gmail.com> wrote:
>>>
>>>> Hi John,
>>>>
>>>> Thanks for making the updates.
>>>>
>>>> I agree with the information you've included in the logs as described
>>>> above, as log statements without enough context/information can be
>>>> frustrating.
>>>>
>>>> -Bill
>>>>
>>>> On Tue, Apr 3, 2018 at 3:29 PM, John Roesler <jo...@confluent.io> wrote:
>>>>
>>>>> Allrighty, how about this, then...
>>>>>
>>>>> I'll move the metric back to the StreamThread and maintain the existing
>>>> tag
>>>>> (client-id=...(per-thread client-id)). It won't be present in the
>>>>> TopologyTestDriver's metrics.
>>>>>
>>>>> As a side note, I'm not sure that the location of the log messages has
>>>>> visibility into the name of the thread or the task, or the processor
>>>> node,
>>>>> for that matter. But at the end of the day, I don't think it really
>>>>> matters.
>>>>>
>>>>> None of those identifiers are in the public interface or
>> user-controlled.
>>>>> For them to be useful for debugging, users would have to gain a very
>> deep
>>>>> understanding of how their DSL program gets executed. From my
>>>> perspective,
>>>>> they are all included in metric tags only to prevent collisions between
>>>> the
>>>>> same metrics in different (e.g.) threads.
>>>>>
>>>>> I think what's important is to provide the right information in the
>> logs
>>>>> that users will be able to debug their issues. This is why the logs in
>> my
>>>>> pr include the topic/partition/offset of the offending data, as well as
>>>> the
>>>>> stacktrace of the exception from the deserializer (or for timestamps,
>> the
>>>>> extracted timestamp and the class name of their extractor). This
>>>>> information alone should let them pinpoint the offending data and fix
>> it.
>>>>>
>>>>> (I am aware that that topic name might be a repartition topic, and
>>>>> therefore also esoteric from the user's perspective, but I think it's
>> the
>>>>> best we can do right now. It might be nice to explicitly take on a
>>>>> debugging ergonomics task in the future and give all processor nodes
>>>>> human-friendly names. Then, we could surface these names in any logs or
>>>>> exceptions. But I'm inclined to call this out-of-scope for now.)
>>>>>
>>>>> Thanks again,
>>>>> -John
>>>>>
>>>>> On Tue, Apr 3, 2018 at 1:40 PM, Guozhang Wang <wa...@gmail.com>
>>>> wrote:
>>>>>
>>>>>> 1. If we can indeed gather all the context information from the log4j
>>>>>> entries I'd suggest we change to thread-level (I'm not sure if that is
>>>>>> doable, so if John have already some WIP PR that can help us decide).
>>>>>>
>>>>>> 2. We can consider adding the API in TopologyTestDriver for general
>>>>> testing
>>>>>> purposes; that being said, I think Matthias has a good point that this
>>>>>> alone should not be a driving motivation for us to keep this metric as
>>>>>> task-level if 1) is true.
>>>>>>
>>>>>>
>>>>>>
>>>>>> Guozhang
>>>>>>
>>>>>>
>>>>>> On Tue, Apr 3, 2018 at 11:36 AM, Matthias J. Sax <
>>>> matthias@confluent.io>
>>>>>> wrote:
>>>>>>
>>>>>>> Thanks Guozhang, that was my intent.
>>>>>>>
>>>>>>> @John: yes, we should not nail down the exact log message. It's just
>>>> to
>>>>>>> point out the trade-off. If we can get the required information in
>>>> the
>>>>>>> logs, we might not need task level metrics.
>>>>>>>
>>>>>>>
>>>>>>> -Matthias
>>>>>>>
>>>>>>> On 4/3/18 11:26 AM, Guozhang Wang wrote:
>>>>>>>> I think Matthias' comment is that, we can still record the metrics
>>>> on
>>>>>> the
>>>>>>>> thread-level, while having the WARN log entry to include sufficient
>>>>>>> context
>>>>>>>> information so that users can still easily narrow down the
>>>>>> investigation
>>>>>>>> scope.
>>>>>>>>
>>>>>>>>
>>>>>>>> Guozhang
>>>>>>>>
>>>>>>>> On Tue, Apr 3, 2018 at 11:22 AM, John Roesler <jo...@confluent.io>
>>>>>> wrote:
>>>>>>>>
>>>>>>>>> I agree we should add as much information as is reasonable to the
>>>>> log.
>>>>>>> For
>>>>>>>>> example, see this WIP PR I started for this KIP:
>>>>>>>>>
>>>>>>>>> https://github.com/apache/kafka/pull/4812/files#diff-
>>>>>>>>> 88d129f048bc842c7db5b2566a45fce8R80
>>>>>>>>>
>>>>>>>>> and
>>>>>>>>>
>>>>>>>>> https://github.com/apache/kafka/pull/4812/files#diff-
>>>>>>>>> 69e6789eb675ec978a1abd24fed96eb1R111
>>>>>>>>>
>>>>>>>>> I'm not sure if we should nail down the log messages in the KIP or
>>>>> in
>>>>>>> the
>>>>>>>>> PR discussion. What say you?
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>> -John
>>>>>>>>>
>>>>>>>>> On Tue, Apr 3, 2018 at 12:20 AM, Matthias J. Sax <
>>>>>> matthias@confluent.io
>>>>>>>>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> Thanks for sharing your thoughts. As I mentioned originally, I am
>>>>> not
>>>>>>>>>> sure about the right log level either. Your arguments are
>>>>> convincing
>>>>>> --
>>>>>>>>>> thus, I am fine with keeping WARN level.
>>>>>>>>>>
>>>>>>>>>> The task vs thread level argument is an interesting one.
>>>> However, I
>>>>>> am
>>>>>>>>>> wondering if we should add this information into the
>>>> corresponding
>>>>>> WARN
>>>>>>>>>> logs that we write anyway? For this case, we can also log the
>>>>>>>>>> corresponding operator (and other information like topic name etc
>>>>> if
>>>>>>>>>> needed). WDYT about this?
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> -Matthias
>>>>>>>>>>
>>>>>>>>>> On 4/2/18 8:31 PM, Guozhang Wang wrote:
>>>>>>>>>>> Regarding logging: I'm inclined to keep logging at WARN level
>>>>> since
>>>>>>>>>> skipped
>>>>>>>>>>> records are not expected in normal execution (for all reasons
>>>> that
>>>>>> we
>>>>>>>>> are
>>>>>>>>>>> aware of), and hence when error happens users should be alerted
>>>>> from
>>>>>>>>>>> metrics and looked into the log files, so to me if it is really
>>>>>>>>> spamming
>>>>>>>>>>> the log files it is also a good alert for users. Besides for
>>>>>>>>> deserialize
>>>>>>>>>>> errors we already log at WARN level for this reason.
>>>>>>>>>>>
>>>>>>>>>>> Regarding the metrics-levels: I was pondering on that as well.
>>>>> What
>>>>>>>>> made
>>>>>>>>>> me
>>>>>>>>>>> to think and agree on task-level than thread-level is that for
>>>>> some
>>>>>>>>>> reasons
>>>>>>>>>>> like window retention, they may possibly be happening on a
>>>> subset
>>>>> of
>>>>>>>>>> input
>>>>>>>>>>> partitions, and tasks are correlated with partitions the
>>>>> task-level
>>>>>>>>>> metrics
>>>>>>>>>>> can help users to narrow down on the specific input data
>>>>> partitions.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Guozhang
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Mon, Apr 2, 2018 at 6:43 PM, John Roesler <john@confluent.io
>>>>>
>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Hi Matthias,
>>>>>>>>>>>>
>>>>>>>>>>>> No worries! Thanks for the reply.
>>>>>>>>>>>>
>>>>>>>>>>>> 1) There isn't a connection. I tried using the
>>>> TopologyTestDriver
>>>>>> to
>>>>>>>>>> write
>>>>>>>>>>>> a quick test exercising the current behavior and discovered
>>>> that
>>>>>> the
>>>>>>>>>>>> metrics weren't available. It seemed like they should be, so I
>>>>>> tacked
>>>>>>>>>> it on
>>>>>>>>>>>> to this KIP. If you feel it's inappropriate, I can pull it back
>>>>>> out.
>>>>>>>>>>>>
>>>>>>>>>>>> 2) I was also concerned about that, but I figured it would come
>>>>> up
>>>>>> in
>>>>>>>>>>>> discussion if I just went ahead and proposed it. And here we
>>>> are!
>>>>>>>>>>>>
>>>>>>>>>>>> Here's my thought: maybe there are two classes of skips:
>>>>>> "controlled"
>>>>>>>>>> and
>>>>>>>>>>>> "uncontrolled", where "controlled" means, as an app author, I
>>>>>>>>>> deliberately
>>>>>>>>>>>> filter out some events, and "uncontrolled" means that I simply
>>>>>> don't
>>>>>>>>>>>> account for some feature of the data, and the framework skips
>>>>> them
>>>>>>> (as
>>>>>>>>>>>> opposed to crashing).
>>>>>>>>>>>>
>>>>>>>>>>>> In this breakdowns, the skips I'm adding metrics for are all
>>>>>>>>>> uncontrolled
>>>>>>>>>>>> skips (and we hope to measure all the uncontrolled skips). Our
>>>>>> skips
>>>>>>>>> are
>>>>>>>>>>>> well documented, so it wouldn't be terrible to have an
>>>>> application
>>>>>> in
>>>>>>>>>> which
>>>>>>>>>>>> you know you expect to have tons of uncontrolled skips, but
>>>> it's
>>>>>> not
>>>>>>>>>> great
>>>>>>>>>>>> either, since you may also have some *unexpected* uncontrolled
>>>>>> skips.
>>>>>>>>>> It'll
>>>>>>>>>>>> be difficult to notice, since you're probably not alerting on
>>>> the
>>>>>>>>> metric
>>>>>>>>>>>> and filtering out the logs (whatever their level).
>>>>>>>>>>>>
>>>>>>>>>>>> I'd recommend any app author, as an alternative, to convert all
>>>>>>>>> expected
>>>>>>>>>>>> skips to controlled ones, by updating the topology to filter
>>>>> those
>>>>>>>>>> records
>>>>>>>>>>>> out.
>>>>>>>>>>>>
>>>>>>>>>>>> Following from my recommendation, as a library author, I'm
>>>>> inclined
>>>>>>> to
>>>>>>>>>> mark
>>>>>>>>>>>> those logs WARN, since in my opinion, they should be concerning
>>>>> to
>>>>>>> the
>>>>>>>>>> app
>>>>>>>>>>>> authors. I'd definitely want to show, rather than hide, them by
>>>>>>>>>> default, so
>>>>>>>>>>>> I would pick INFO at least.
>>>>>>>>>>>>
>>>>>>>>>>>> That said, logging is always a tricky issue for lower-level
>>>>>> libraries
>>>>>>>>>> that
>>>>>>>>>>>> run inside user code, since we don't have all the information
>>>> we
>>>>>> need
>>>>>>>>> to
>>>>>>>>>>>> make the right call.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On your last note, yeah, I got that impression from Guozhang as
>>>>>> well.
>>>>>>>>>>>> Thanks for the clarification.
>>>>>>>>>>>>
>>>>>>>>>>>> -John
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Mon, Apr 2, 2018 at 4:03 PM, Matthias J. Sax <
>>>>>>>>> matthias@confluent.io>
>>>>>>>>>>>> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> John,
>>>>>>>>>>>>>
>>>>>>>>>>>>> sorry for my late reply and thanks for updating the KIP.
>>>>>>>>>>>>>
>>>>>>>>>>>>> I like your approach about "metrics are for monitoring, logs
>>>> are
>>>>>> for
>>>>>>>>>>>>> debugging" -- however:
>>>>>>>>>>>>>
>>>>>>>>>>>>> 1) I don't see a connection between this and the task-level
>>>>>> metrics
>>>>>>>>>> that
>>>>>>>>>>>>> you propose to get the metrics in `TopologyTestDriver`. I
>>>> don't
>>>>>>> think
>>>>>>>>>>>>> people would monitor the `TopologyTestDriver` an thus
>>>> wondering
>>>>>> why
>>>>>>>>> it
>>>>>>>>>>>>> is important to include the metrics there? Thread-level metric
>>>>>> might
>>>>>>>>> be
>>>>>>>>>>>>> easier to monitor though (ie, less different metric to
>>>> monitor).
>>>>>>>>>>>>>
>>>>>>>>>>>>> 2) I am a little worried about WARN level logging and that it
>>>>>> might
>>>>>>>>> be
>>>>>>>>>>>>> too chatty -- as you pointed out, it's about debugging, thus
>>>>> DEBUG
>>>>>>>>>> level
>>>>>>>>>>>>> might be better. Not 100% sure about this to be honest. What
>>>> is
>>>>>> the
>>>>>>>>>>>>> general assumption about the frequency for skipped records? I
>>>>>> could
>>>>>>>>>>>>> imagine cases for which skipped records are quite frequent and
>>>>>> thus,
>>>>>>>>>>>>> WARN level logs might "flood" the logs
>>>>>>>>>>>>>
>>>>>>>>>>>>> One final remark:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> More
>>>>>>>>>>>>>> generally, I would like to establish a pattern in which we
>>>>> could
>>>>>>> add
>>>>>>>>>>>> new
>>>>>>>>>>>>>> values for the "reason" tags without needing a KIP to do so.
>>>>>>>>>>>>>
>>>>>>>>>>>>> From my understanding, this is not feasible. Changing metrics
>>>> is
>>>>>>>>> always
>>>>>>>>>>>>> considered a public API change, and we need a KIP for any
>>>>> change.
>>>>>> As
>>>>>>>>> we
>>>>>>>>>>>>> moved away from tagging, it doesn't matter for the KIP anymore
>>>>> --
>>>>>>>>> just
>>>>>>>>>>>>> wanted to point it out.
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> -Matthias
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On 3/30/18 2:47 PM, John Roesler wrote:
>>>>>>>>>>>>>> Allrighty! The KIP is updated.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thanks again, all, for the feedback.
>>>>>>>>>>>>>> -John
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Fri, Mar 30, 2018 at 3:35 PM, John Roesler <
>>>>> john@confluent.io
>>>>>>>
>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Hey Guozhang and Bill,
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Ok, I'll update the KIP. At the risk of disturbing
>>>> consensus,
>>>>>> I'd
>>>>>>>>>> like
>>>>>>>>>>>>> to
>>>>>>>>>>>>>>> put it in the task instead of the thread so that it'll show
>>>> up
>>>>>> in
>>>>>>>>> the
>>>>>>>>>>>>>>> TopologyTestDriver metrics as well.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I'm leaning toward keeping the scope where it is right now,
>>>>> but
>>>>>> if
>>>>>>>>>>>>> others
>>>>>>>>>>>>>>> want to advocate for tossing in some more metrics, we can go
>>>>>> that
>>>>>>>>>>>> route.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Thanks all,
>>>>>>>>>>>>>>> -John
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Fri, Mar 30, 2018 at 2:37 PM, Bill Bejeck <
>>>>> bbejeck@gmail.com
>>>>>>>
>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Thanks for the KIP John, and sorry for the late comments.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I'm on the fence with providing a single level metrics,
>>>> but I
>>>>>>>>> think
>>>>>>>>>>>>> we'll
>>>>>>>>>>>>>>>> have that discussion outside of this KIP.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> * maintain one skipped-record metric (could be per-thread,
>>>>>>>>>> per-task,
>>>>>>>>>>>>> or
>>>>>>>>>>>>>>>>> per-processor-node) with no "reason"
>>>>>>>>>>>>>>>>> * introduce a warn-level log detailing the
>>>>>>> topic/partition/offset
>>>>>>>>>>>> and
>>>>>>>>>>>>>>>>> reason of the skipped record
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I'm +1 on both of these suggestions.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Finally, we have had requests in the past for some metrics
>>>>>> around
>>>>>>>>>>>> when
>>>>>>>>>>>>>>>> persistent store removes an expired window.  Would adding
>>>>> that
>>>>>> to
>>>>>>>>>> our
>>>>>>>>>>>>>>>> metrics stretch the scope of this KIP too much?
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Thanks again and overall I'm +1 on this KIP
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Bill
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Fri, Mar 30, 2018 at 2:00 PM, Guozhang Wang <
>>>>>>>>> wangguoz@gmail.com>
>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> The proposal sounds good to me. About "maintain only one
>>>>> level
>>>>>>> of
>>>>>>>>>>>>>>>> metrics"
>>>>>>>>>>>>>>>>> maybe we can discuss about that separately from this KIP
>>>>> since
>>>>>>>>> that
>>>>>>>>>>>>>>>> would
>>>>>>>>>>>>>>>>> be a larger scope of discussion. I agree that if we are
>>>>> going
>>>>>> to
>>>>>>>>>>>>>>>> maintain
>>>>>>>>>>>>>>>>> only one-level metrics it should be lowest level and we
>>>>> would
>>>>>>> let
>>>>>>>>>>>>> users
>>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>> do the roll-ups themselves, but I'm still not fully
>>>>> convinced
>>>>>>>>> that
>>>>>>>>>>>> we
>>>>>>>>>>>>>>>>> should just provide single-level metrics, because 1) I
>>>> think
>>>>>> for
>>>>>>>>>>>>>>>> different
>>>>>>>>>>>>>>>>> metrics people may be interested to investigate into
>>>>> different
>>>>>>>>>>>>>>>>> granularities, e.g. for poll / commit rate these are at
>>>> the
>>>>>>>>> lowest
>>>>>>>>>>>>>>>>> task-level metrics, while for process-rate / skip-rate
>>>> they
>>>>>> can
>>>>>>>>> be
>>>>>>>>>>>> as
>>>>>>>>>>>>>>>> low
>>>>>>>>>>>>>>>>> as processor-node metrics, and 2) user-side rolling ups
>>>> may
>>>>>> not
>>>>>>>>> be
>>>>>>>>>>>>> very
>>>>>>>>>>>>>>>>> straight-forward. But for 2) if someone can provide an
>>>>>> efficient
>>>>>>>>>> and
>>>>>>>>>>>>>>>> easy
>>>>>>>>>>>>>>>>> implementation of that I can be persuaded :)
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> For now I'm thinking we can add the metric on
>>>> thread-level,
>>>>>>>>> either
>>>>>>>>>>>>> with
>>>>>>>>>>>>>>>>> finer grained ones with "reason" tag plus an aggregated
>>>> one
>>>>>>>>> without
>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>> tag, or just having a single aggregated metric without the
>>>>> tag
>>>>>>>>>> looks
>>>>>>>>>>>>>>>> good
>>>>>>>>>>>>>>>>> to me.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Guozhang
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> On Fri, Mar 30, 2018 at 8:05 AM, John Roesler <
>>>>>>> john@confluent.io
>>>>>>>>>>
>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Hey Guozhang,
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Thanks for the reply. Regarding JMX, I can dig it. I'll
>>>>>> provide
>>>>>>>>> a
>>>>>>>>>>>>>>>> list in
>>>>>>>>>>>>>>>>>> the KIP. I was also thinking we'd better start a
>>>>>> documentation
>>>>>>>>>> page
>>>>>>>>>>>>>>>> with
>>>>>>>>>>>>>>>>>> the metrics listed.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> I'd have no problem logging a warning when we skip
>>>> records.
>>>>>> On
>>>>>>>>> the
>>>>>>>>>>>>>>>> metric
>>>>>>>>>>>>>>>>>> front, really I'm just pushing for us to maintain only
>>>> one
>>>>>>> level
>>>>>>>>>> of
>>>>>>>>>>>>>>>>>> metrics. If that's more or less granular (i.e., maybe we
>>>>>> don't
>>>>>>>>>>>> have a
>>>>>>>>>>>>>>>>>> metric per reason and log the reason instead), that's
>>>> fine
>>>>> by
>>>>>>>>> me.
>>>>>>>>>> I
>>>>>>>>>>>>>>>> just
>>>>>>>>>>>>>>>>>> don't think it provides a lot of extra value per
>>>> complexity
>>>>>>>>>>>>> (interface
>>>>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>>>> implementation) to maintain roll-ups at the thread level
>>>> in
>>>>>>>>>>>> addition
>>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>> lower-level metrics.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> How about this instead:
>>>>>>>>>>>>>>>>>> * maintain one skipped-record metric (could be
>>>> per-thread,
>>>>>>>>>>>> per-task,
>>>>>>>>>>>>>>>> or
>>>>>>>>>>>>>>>>>> per-processor-node) with no "reason"
>>>>>>>>>>>>>>>>>> * introduce a warn-level log detailing the
>>>>>>>>> topic/partition/offset
>>>>>>>>>>>> and
>>>>>>>>>>>>>>>>>> reason of the skipped record
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> If you like that, I can update the KIP.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>>> -John
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> On Thu, Mar 29, 2018 at 6:22 PM, Guozhang Wang <
>>>>>>>>>> wangguoz@gmail.com
>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> One thing you mention is the notion of setting alerts
>>>> on
>>>>>>>>> coarser
>>>>>>>>>>>>>>>>>> metrics
>>>>>>>>>>>>>>>>>>> being easier than finer ones. All the metric alerting
>>>>>> systems
>>>>>>> I
>>>>>>>>>>>> have
>>>>>>>>>>>>>>>>> used
>>>>>>>>>>>>>>>>>>> make it equally easy to alert on metrics by-tag or over
>>>>>> tags.
>>>>>>>>> So
>>>>>>>>>>>> my
>>>>>>>>>>>>>>>>>>> experience doesn't say that this is a use case. Were you
>>>>>>>>> thinking
>>>>>>>>>>>>>>>> of an
>>>>>>>>>>>>>>>>>>> alerting system that makes such a pre-aggregation
>>>>> valuable?
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> For the commonly used JMX reporter tags will be encoded
>>>>>>>>> directly
>>>>>>>>>>>> as
>>>>>>>>>>>>>>>>> part
>>>>>>>>>>>>>>>>>> of
>>>>>>>>>>>>>>>>>>> the object name, and if users wants to monitor them they
>>>>>> need
>>>>>>>>> to
>>>>>>>>>>>>>>>> know
>>>>>>>>>>>>>>>>>> these
>>>>>>>>>>>>>>>>>>> values before hand. That is also why I think we do want
>>>> to
>>>>>>> list
>>>>>>>>>>>> all
>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>> possible values of the reason tags in the KIP, since
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> In my email in response to Matthias, I gave an example
>>>> of
>>>>>> the
>>>>>>>>>>>>>>>> kind of
>>>>>>>>>>>>>>>>>>> scenario that would lead me as an operator to run with
>>>>> DEBUG
>>>>>>> on
>>>>>>>>>>>> all
>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>> time, since I wouldn't be sure, having seen a skipped
>>>>> record
>>>>>>>>>> once,
>>>>>>>>>>>>>>>> that
>>>>>>>>>>>>>>>>>> it
>>>>>>>>>>>>>>>>>>> would ever happen again. The solution is to capture all
>>>>> the
>>>>>>>>>>>>>>>> available
>>>>>>>>>>>>>>>>>>> information about the reason and location of skips all
>>>> the
>>>>>>>>> time.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> That is a good point. I think we can either expose all
>>>>>> levels
>>>>>>>>>>>>>>>> metrics
>>>>>>>>>>>>>>>>> as
>>>>>>>>>>>>>>>>>> by
>>>>>>>>>>>>>>>>>>> default, or only expose the most lower-level metrics and
>>>>> get
>>>>>>>>> rid
>>>>>>>>>>>> of
>>>>>>>>>>>>>>>>> other
>>>>>>>>>>>>>>>>>>> levels to let users do roll-ups themselves (which will
>>>> be
>>>>> a
>>>>>>>>> much
>>>>>>>>>>>>>>>> larger
>>>>>>>>>>>>>>>>>>> scope for discussion), or we can encourage users to not
>>>>>> purely
>>>>>>>>>>>>>>>> depend
>>>>>>>>>>>>>>>>> on
>>>>>>>>>>>>>>>>>>> metrics for such trouble shooting: that is to say, users
>>>>>> only
>>>>>>>>> be
>>>>>>>>>>>>>>>>> alerted
>>>>>>>>>>>>>>>>>>> based on metrics, and we can log a info / warn log4j
>>>> entry
>>>>>>> each
>>>>>>>>>>>>>>>> time we
>>>>>>>>>>>>>>>>>> are
>>>>>>>>>>>>>>>>>>> about to skip a record all over the places, so that upon
>>>>>> being
>>>>>>>>>>>>>>>> notified
>>>>>>>>>>>>>>>>>>> users can look into the logs to find the details on
>>>> where
>>>>> /
>>>>>>>>> when
>>>>>>>>>>>> it
>>>>>>>>>>>>>>>>>>> happens. WDYT?
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Guozhang
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> On Thu, Mar 29, 2018 at 3:57 PM, John Roesler <
>>>>>>>>> john@confluent.io
>>>>>>>>>>>
>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Hey Guozhang,
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Thanks for the review.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> 1.
>>>>>>>>>>>>>>>>>>>> Matthias raised the same question about the "reason"
>>>> tag
>>>>>>>>> values.
>>>>>>>>>>>> I
>>>>>>>>>>>>>>>>> can
>>>>>>>>>>>>>>>>>>> list
>>>>>>>>>>>>>>>>>>>> all possible values of the "reason" tag, but I'm
>>>> thinking
>>>>>>> this
>>>>>>>>>>>>>>>> level
>>>>>>>>>>>>>>>>> of
>>>>>>>>>>>>>>>>>>>> detail may not be KIP-worthy, maybe the code and
>>>>>>> documentation
>>>>>>>>>>>>>>>> review
>>>>>>>>>>>>>>>>>>> would
>>>>>>>>>>>>>>>>>>>> be sufficient. If you all disagree and would like it
>>>>>> included
>>>>>>>>> in
>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>> KIP, I
>>>>>>>>>>>>>>>>>>>> can certainly do that.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> If we do provide roll-up metrics, I agree with the
>>>>> pattern
>>>>>> of
>>>>>>>>>>>>>>>> keeping
>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>> same name but eliminating the tags for the dimensions
>>>>> that
>>>>>>>>> were
>>>>>>>>>>>>>>>>>>> rolled-up.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> 2.
>>>>>>>>>>>>>>>>>>>> I'm not too sure that implementation efficiency really
>>>>>>>>> becomes a
>>>>>>>>>>>>>>>>> factor
>>>>>>>>>>>>>>>>>>> in
>>>>>>>>>>>>>>>>>>>> choosing whether to (by default) update one coarse
>>>> metric
>>>>>> at
>>>>>>>>> the
>>>>>>>>>>>>>>>>> thread
>>>>>>>>>>>>>>>>>>>> level or one granular metric at the processor-node
>>>> level,
>>>>>>>>> since
>>>>>>>>>>>>>>>> it's
>>>>>>>>>>>>>>>>>> just
>>>>>>>>>>>>>>>>>>>> one metric being updated either way. I do agree that if
>>>>> we
>>>>>>>>> were
>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>> update
>>>>>>>>>>>>>>>>>>>> the granular metrics and multiple roll-ups, then we
>>>>> should
>>>>>>>>>>>>>>>> consider
>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>> efficiency.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> I agree it's probably not necessary to surface the
>>>>> metrics
>>>>>>> for
>>>>>>>>>>>> all
>>>>>>>>>>>>>>>>>> nodes
>>>>>>>>>>>>>>>>>>>> regardless of whether they can or do skip records.
>>>>> Perhaps
>>>>>> we
>>>>>>>>>> can
>>>>>>>>>>>>>>>>>> lazily
>>>>>>>>>>>>>>>>>>>> register the metrics.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> In my email in response to Matthias, I gave an example
>>>> of
>>>>>> the
>>>>>>>>>>>>>>>> kind of
>>>>>>>>>>>>>>>>>>>> scenario that would lead me as an operator to run with
>>>>>> DEBUG
>>>>>>>>> on
>>>>>>>>>>>>>>>> all
>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>> time, since I wouldn't be sure, having seen a skipped
>>>>>> record
>>>>>>>>>>>> once,
>>>>>>>>>>>>>>>>> that
>>>>>>>>>>>>>>>>>>> it
>>>>>>>>>>>>>>>>>>>> would ever happen again. The solution is to capture all
>>>>> the
>>>>>>>>>>>>>>>> available
>>>>>>>>>>>>>>>>>>>> information about the reason and location of skips all
>>>>> the
>>>>>>>>> time.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> One thing you mention is the notion of setting alerts
>>>> on
>>>>>>>>> coarser
>>>>>>>>>>>>>>>>>> metrics
>>>>>>>>>>>>>>>>>>>> being easier than finer ones. All the metric alerting
>>>>>> systems
>>>>>>>>> I
>>>>>>>>>>>>>>>> have
>>>>>>>>>>>>>>>>>> used
>>>>>>>>>>>>>>>>>>>> make it equally easy to alert on metrics by-tag or over
>>>>>> tags.
>>>>>>>>> So
>>>>>>>>>>>>>>>> my
>>>>>>>>>>>>>>>>>>>> experience doesn't say that this is a use case. Were
>>>> you
>>>>>>>>>> thinking
>>>>>>>>>>>>>>>> of
>>>>>>>>>>>>>>>>> an
>>>>>>>>>>>>>>>>>>>> alerting system that makes such a pre-aggregation
>>>>> valuable?
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Thanks again,
>>>>>>>>>>>>>>>>>>>> -John
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> On Thu, Mar 29, 2018 at 5:24 PM, Guozhang Wang <
>>>>>>>>>>>>>>>> wangguoz@gmail.com>
>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> Hello John,
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> Thanks for the KIP. Some comments:
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> 1. Could you list all the possible values of the
>>>>> "reason"
>>>>>>>>> tag?
>>>>>>>>>>>>>>>> In
>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>> JIRA
>>>>>>>>>>>>>>>>>>>>> ticket I left some potential reasons but I'm not clear
>>>>> if
>>>>>>>>>> you're
>>>>>>>>>>>>>>>>>> going
>>>>>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>>>>> categorize each of them as a separate reason, or is
>>>>> there
>>>>>>> any
>>>>>>>>>>>>>>>>>>> additional
>>>>>>>>>>>>>>>>>>>>> ones you have in mind.
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> Also I'm wondering if we should add another metric
>>>> that
>>>>> do
>>>>>>>>> not
>>>>>>>>>>>>>>>> have
>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>> reason tag but aggregates among all possible reasons?
>>>>> This
>>>>>>> is
>>>>>>>>>>>>>>>> for
>>>>>>>>>>>>>>>>>> users
>>>>>>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>>>>> easily set their alerting notifications (otherwise
>>>> they
>>>>>> have
>>>>>>>>> to
>>>>>>>>>>>>>>>>> write
>>>>>>>>>>>>>>>>>>> on
>>>>>>>>>>>>>>>>>>>>> notification rule per reason) in their monitoring
>>>>> systems.
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> 2. Note that the processor-node metrics is actually
>>>>>>>>>> "per-thread,
>>>>>>>>>>>>>>>>>>>> per-task,
>>>>>>>>>>>>>>>>>>>>> per-processor-node", and today we only set the
>>>>> per-thread
>>>>>>>>>>>>>>>> metrics
>>>>>>>>>>>>>>>>> as
>>>>>>>>>>>>>>>>>>> INFO
>>>>>>>>>>>>>>>>>>>>> while leaving the lower two layers as DEBUG. I agree
>>>>> with
>>>>>>>>> your
>>>>>>>>>>>>>>>>>> argument
>>>>>>>>>>>>>>>>>>>>> that we are missing the per-client roll-up metrics
>>>>> today,
>>>>>>> but
>>>>>>>>>>>>>>>> I'm
>>>>>>>>>>>>>>>>>>>> convinced
>>>>>>>>>>>>>>>>>>>>> that the right way to approach it would be
>>>>>>>>>>>>>>>>>> "just-providing-the-lowest-
>>>>>>>>>>>>>>>>>>>>> level
>>>>>>>>>>>>>>>>>>>>> metrics only".
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> Note the recoding implementation of these three levels
>>>>> are
>>>>>>>>>>>>>>>>> different
>>>>>>>>>>>>>>>>>>>>> internally today: we did not just do the rolling up to
>>>>>>>>> generate
>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>> higher-level metrics from the lower level ones, but we
>>>>>> just
>>>>>>>>>>>>>>>> record
>>>>>>>>>>>>>>>>>> them
>>>>>>>>>>>>>>>>>>>>> separately, which means that, if we turn on multiple
>>>>>> levels
>>>>>>>>> of
>>>>>>>>>>>>>>>>>> metrics,
>>>>>>>>>>>>>>>>>>>> we
>>>>>>>>>>>>>>>>>>>>> maybe duplicate collecting some metrics. One can argue
>>>>>> that
>>>>>>>>> is
>>>>>>>>>>>>>>>> not
>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>> best
>>>>>>>>>>>>>>>>>>>>> way to represent multi-level metrics collecting and
>>>>>>>>> reporting,
>>>>>>>>>>>>>>>> but
>>>>>>>>>>>>>>>>> by
>>>>>>>>>>>>>>>>>>>> only
>>>>>>>>>>>>>>>>>>>>> enabling thread-level metrics as INFO today, that
>>>>>>>>>> implementation
>>>>>>>>>>>>>>>>>> could
>>>>>>>>>>>>>>>>>>> be
>>>>>>>>>>>>>>>>>>>>> more efficient than only collecting the metrics at the
>>>>>>> lowest
>>>>>>>>>>>>>>>>> level,
>>>>>>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>>>>>>> then do the roll-up calculations outside of the
>>>> metrics
>>>>>>>>>> classes.
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> Plus, today not all processor-nodes may possibly skip
>>>>>>>>> records,
>>>>>>>>>>>>>>>>> AFAIK
>>>>>>>>>>>>>>>>>> we
>>>>>>>>>>>>>>>>>>>>> will only skip records at the source, sink, window and
>>>>>>>>>>>>>>>> aggregation
>>>>>>>>>>>>>>>>>>>>> processor nodes, so adding a metric per processor
>>>> looks
>>>>>> like
>>>>>>>>> an
>>>>>>>>>>>>>>>>>>> overkill
>>>>>>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>>>>> me as well. On the other hand, from user's perspective
>>>>> the
>>>>>>>>>>>>>>>> "reason"
>>>>>>>>>>>>>>>>>> tag
>>>>>>>>>>>>>>>>>>>> may
>>>>>>>>>>>>>>>>>>>>> be sufficient for them to narrow down where inside the
>>>>>>>>> topology
>>>>>>>>>>>>>>>> is
>>>>>>>>>>>>>>>>>>>> causing
>>>>>>>>>>>>>>>>>>>>> records to be dropped on the floor. So I think the
>>>>>>>>> "per-thread,
>>>>>>>>>>>>>>>>>>> per-task"
>>>>>>>>>>>>>>>>>>>>> level metrics should be sufficient for them in trouble
>>>>>> shoot
>>>>>>>>> in
>>>>>>>>>>>>>>>>> DEBUG
>>>>>>>>>>>>>>>>>>>> mode,
>>>>>>>>>>>>>>>>>>>>> and we can add another "per-thread" level metrics as
>>>>> INFO
>>>>>>>>> which
>>>>>>>>>>>>>>>> is
>>>>>>>>>>>>>>>>>>> turned
>>>>>>>>>>>>>>>>>>>>> on by default. So under normal execution users still
>>>>> only
>>>>>>>>> need
>>>>>>>>>>>>>>>> INFO
>>>>>>>>>>>>>>>>>>> level
>>>>>>>>>>>>>>>>>>>>> metrics for alerting (e.g. set alerts on all
>>>>>> skipped-records
>>>>>>>>>>>>>>>>> metrics
>>>>>>>>>>>>>>>>>> as
>>>>>>>>>>>>>>>>>>>>> non-zero), and then upon trouble shooting they can
>>>> turn
>>>>> on
>>>>>>>>>> DEBUG
>>>>>>>>>>>>>>>>>>> metrics
>>>>>>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>>>>> look into which task is actually causing the skipped
>>>>>>> records.
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> Guozhang
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> On Thu, Mar 29, 2018 at 2:03 PM, Matthias J. Sax <
>>>>>>>>>>>>>>>>>>> matthias@confluent.io>
>>>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> Thanks for the KIP John.
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> Reading the material on the related Jiras, I am
>>>>> wondering
>>>>>>>>> what
>>>>>>>>>>>>>>>>>>> `reason`
>>>>>>>>>>>>>>>>>>>>>> tags you want to introduce? Can you elaborate? The
>>>> KIP
>>>>>>>>> should
>>>>>>>>>>>>>>>>> list
>>>>>>>>>>>>>>>>>>>> those
>>>>>>>>>>>>>>>>>>>>>> IMHO.
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> About the fine grained metrics vs the roll-up: you
>>>> say
>>>>>> that
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> the coarse metric aggregates across two dimensions
>>>>>>>>>>>>>>>>> simultaneously
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> Can you elaborate why this is an issue? I am not
>>>>>> convinced
>>>>>>>>> atm
>>>>>>>>>>>>>>>>> that
>>>>>>>>>>>>>>>>>>> we
>>>>>>>>>>>>>>>>>>>>>> should put the fine grained metrics into INFO level
>>>> and
>>>>>>>>> remove
>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>> roll-up at thread level.
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> Given that they have to do this sum to get a usable
>>>>>>>>>>>>>>>> top-level
>>>>>>>>>>>>>>>>>> view
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> This is a fair concern, but I don't share the
>>>>> conclusion.
>>>>>>>>>>>>>>>>> Offering
>>>>>>>>>>>>>>>>>> a
>>>>>>>>>>>>>>>>>>>>>> built-in `KafkaStreams` "client" roll-up out of the
>>>> box
>>>>>>>>> might
>>>>>>>>>>>>>>>> be
>>>>>>>>>>>>>>>>> a
>>>>>>>>>>>>>>>>>>>>>> better solution. In the past we did not offer this
>>>> due
>>>>> to
>>>>>>>>>>>>>>>>>> performance
>>>>>>>>>>>>>>>>>>>>>> concerns, but we could allow an "opt-in" mechanism.
>>>> If
>>>>>> you
>>>>>>>>>>>>>>>>>> disagree,
>>>>>>>>>>>>>>>>>>>> can
>>>>>>>>>>>>>>>>>>>>>> you provide some reasoning and add them to the
>>>>> "Rejected
>>>>>>>>>>>>>>>>>>> alternatives"
>>>>>>>>>>>>>>>>>>>>>> section.
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> To rephrase: I understand the issue about missing
>>>>>> top-level
>>>>>>>>>>>>>>>> view,
>>>>>>>>>>>>>>>>>> but
>>>>>>>>>>>>>>>>>>>>>> instead of going more fine grained, we should
>>>> consider
>>>>> to
>>>>>>>>> add
>>>>>>>>>>>>>>>>> this
>>>>>>>>>>>>>>>>>>>>>> top-level view and add/keep the fine grained metrics
>>>> at
>>>>>>>>> DEBUG
>>>>>>>>>>>>>>>>> level
>>>>>>>>>>>>>>>>>>>> only
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> I am +1 to add TopologyTestDriver#metrics() and to
>>>>> remove
>>>>>>>>> old
>>>>>>>>>>>>>>>>>> metrics
>>>>>>>>>>>>>>>>>>>>>> directly as you suggested.
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> -Matthias
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> On 3/28/18 6:42 PM, Ted Yu wrote:
>>>>>>>>>>>>>>>>>>>>>>> Looks good to me.
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> On Wed, Mar 28, 2018 at 3:11 PM, John Roesler <
>>>>>>>>>>>>>>>>> john@confluent.io
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> Hello all,
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> I am proposing KIP-274 to improve the metrics
>>>> around
>>>>>>>>>>>>>>>> skipped
>>>>>>>>>>>>>>>>>>> records
>>>>>>>>>>>>>>>>>>>>> in
>>>>>>>>>>>>>>>>>>>>>>>> Streams.
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> Please find the details here:
>>>>>>>>>>>>>>>>>>>>>>>> https://cwiki.apache.org/
>>>>> confluence/display/KAFKA/KIP-
>>>>>>>>>>>>>>>>>>>>>>>> 274%3A+Kafka+Streams+Skipped+Records+Metrics
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> Please let me know what you think!
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>>>>>>>>> -John
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>>>>>>> -- Guozhang
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>>>>> -- Guozhang
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>>> -- Guozhang
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> -- Guozhang
>>>>>>
>>>>>
>>>>
>>>
>>>
>>>
>>
>>
>

Re: [DISCUSS] KIP-274: Kafka Streams Skipped Records Metrics

Posted by John Roesler <jo...@confluent.io>.

Cool, if you're satisfied with the KIP now, maybe I can lobby for your vote
;)

The vote thread is still at only one binding +1, I think.

Thanks,
-John

On Tue, Apr 3, 2018 at 9:05 PM, Matthias J. Sax <ma...@confluent.io>
wrote:

> Sounds great!
>
> The cryptic topic names can be an issue -- however, people can
> `describe()`  their topology to map the name to the corresponding
> sub-topology/tasks to narrow the error down to the corresponding
> operators. I think, this should be "sufficient for now" for debugging.
>
> Renaming those topic seems to be out-of-scope for this KIP.
>
>
> -Matthias
>
> On 4/3/18 2:45 PM, Guozhang Wang wrote:
> > Thanks John, your proposal looks fine to me.
> >
> > I'll go ahead and look into the PR for more details myself.
> >
> >
> > Guozhang
> >
> > On Tue, Apr 3, 2018 at 1:35 PM, Bill Bejeck <bb...@gmail.com> wrote:
> >
> >> Hi John,
> >>
> >> Thanks for making the updates.
> >>
> >> I agree with the information you've included in the logs as described
> >> above, as log statements without enough context/information can be
> >> frustrating.
> >>
> >> -Bill
> >>
> >> On Tue, Apr 3, 2018 at 3:29 PM, John Roesler <jo...@confluent.io> wrote:
> >>
> >>> Allrighty, how about this, then...
> >>>
> >>> I'll move the metric back to the StreamThread and maintain the existing
> >> tag
> >>> (client-id=...(per-thread client-id)). It won't be present in the
> >>> TopologyTestDriver's metrics.
> >>>
> >>> As a side note, I'm not sure that the location of the log messages has
> >>> visibility into the name of the thread or the task, or the processor
> >> node,
> >>> for that matter. But at the end of the day, I don't think it really
> >>> matters.
> >>>
> >>> None of those identifiers are in the public interface or
> user-controlled.
> >>> For them to be useful for debugging, users would have to gain a very
> deep
> >>> understanding of how their DSL program gets executed. From my
> >> perspective,
> >>> they are all included in metric tags only to prevent collisions between
> >> the
> >>> same metrics in different (e.g.) threads.
> >>>
> >>> I think what's important is to provide the right information in the
> logs
> >>> that users will be able to debug their issues. This is why the logs in
> my
> >>> pr include the topic/partition/offset of the offending data, as well as
> >> the
> >>> stacktrace of the exception from the deserializer (or for timestamps,
> the
> >>> extracted timestamp and the class name of their extractor). This
> >>> information alone should let them pinpoint the offending data and fix
> it.
> >>>
> >>> (I am aware that that topic name might be a repartition topic, and
> >>> therefore also esoteric from the user's perspective, but I think it's
> the
> >>> best we can do right now. It might be nice to explicitly take on a
> >>> debugging ergonomics task in the future and give all processor nodes
> >>> human-friendly names. Then, we could surface these names in any logs or
> >>> exceptions. But I'm inclined to call this out-of-scope for now.)
> >>>
> >>> Thanks again,
> >>> -John
> >>>
> >>> On Tue, Apr 3, 2018 at 1:40 PM, Guozhang Wang <wa...@gmail.com>
> >> wrote:
> >>>
> >>>> 1. If we can indeed gather all the context information from the log4j
> >>>> entries I'd suggest we change to thread-level (I'm not sure if that is
> >>>> doable, so if John have already some WIP PR that can help us decide).
> >>>>
> >>>> 2. We can consider adding the API in TopologyTestDriver for general
> >>> testing
> >>>> purposes; that being said, I think Matthias has a good point that this
> >>>> alone should not be a driving motivation for us to keep this metric as
> >>>> task-level if 1) is true.
> >>>>
> >>>>
> >>>>
> >>>> Guozhang
> >>>>
> >>>>
> >>>> On Tue, Apr 3, 2018 at 11:36 AM, Matthias J. Sax <
> >> matthias@confluent.io>
> >>>> wrote:
> >>>>
> >>>>> Thanks Guozhang, that was my intent.
> >>>>>
> >>>>> @John: yes, we should not nail down the exact log message. It's just
> >> to
> >>>>> point out the trade-off. If we can get the required information in
> >> the
> >>>>> logs, we might not need task level metrics.
> >>>>>
> >>>>>
> >>>>> -Matthias
> >>>>>
> >>>>> On 4/3/18 11:26 AM, Guozhang Wang wrote:
> >>>>>> I think Matthias' comment is that, we can still record the metrics
> >> on
> >>>> the
> >>>>>> thread-level, while having the WARN log entry to include sufficient
> >>>>> context
> >>>>>> information so that users can still easily narrow down the
> >>>> investigation
> >>>>>> scope.
> >>>>>>
> >>>>>>
> >>>>>> Guozhang
> >>>>>>
> >>>>>> On Tue, Apr 3, 2018 at 11:22 AM, John Roesler <jo...@confluent.io>
> >>>> wrote:
> >>>>>>
> >>>>>>> I agree we should add as much information as is reasonable to the
> >>> log.
> >>>>> For
> >>>>>>> example, see this WIP PR I started for this KIP:
> >>>>>>>
> >>>>>>> https://github.com/apache/kafka/pull/4812/files#diff-
> >>>>>>> 88d129f048bc842c7db5b2566a45fce8R80
> >>>>>>>
> >>>>>>> and
> >>>>>>>
> >>>>>>> https://github.com/apache/kafka/pull/4812/files#diff-
> >>>>>>> 69e6789eb675ec978a1abd24fed96eb1R111
> >>>>>>>
> >>>>>>> I'm not sure if we should nail down the log messages in the KIP or
> >>> in
> >>>>> the
> >>>>>>> PR discussion. What say you?
> >>>>>>>
> >>>>>>> Thanks,
> >>>>>>> -John
> >>>>>>>
> >>>>>>> On Tue, Apr 3, 2018 at 12:20 AM, Matthias J. Sax <
> >>>> matthias@confluent.io
> >>>>>>
> >>>>>>> wrote:
> >>>>>>>
> >>>>>>>> Thanks for sharing your thoughts. As I mentioned originally, I am
> >>> not
> >>>>>>>> sure about the right log level either. Your arguments are
> >>> convincing
> >>>> --
> >>>>>>>> thus, I am fine with keeping WARN level.
> >>>>>>>>
> >>>>>>>> The task vs thread level argument is an interesting one.
> >> However, I
> >>>> am
> >>>>>>>> wondering if we should add this information into the
> >> corresponding
> >>>> WARN
> >>>>>>>> logs that we write anyway? For this case, we can also log the
> >>>>>>>> corresponding operator (and other information like topic name etc
> >>> if
> >>>>>>>> needed). WDYT about this?
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> -Matthias
> >>>>>>>>
> >>>>>>>> On 4/2/18 8:31 PM, Guozhang Wang wrote:
> >>>>>>>>> Regarding logging: I'm inclined to keep logging at WARN level
> >>> since
> >>>>>>>> skipped
> >>>>>>>>> records are not expected in normal execution (for all reasons
> >> that
> >>>> we
> >>>>>>> are
> >>>>>>>>> aware of), and hence when error happens users should be alerted
> >>> from
> >>>>>>>>> metrics and looked into the log files, so to me if it is really
> >>>>>>> spamming
> >>>>>>>>> the log files it is also a good alert for users. Besides for
> >>>>>>> deserialize
> >>>>>>>>> errors we already log at WARN level for this reason.
> >>>>>>>>>
> >>>>>>>>> Regarding the metrics-levels: I was pondering on that as well.
> >>> What
> >>>>>>> made
> >>>>>>>> me
> >>>>>>>>> to think and agree on task-level than thread-level is that for
> >>> some
> >>>>>>>> reasons
> >>>>>>>>> like window retention, they may possibly be happening on a
> >> subset
> >>> of
> >>>>>>>> input
> >>>>>>>>> partitions, and tasks are correlated with partitions the
> >>> task-level
> >>>>>>>> metrics
> >>>>>>>>> can help users to narrow down on the specific input data
> >>> partitions.
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> Guozhang
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> On Mon, Apr 2, 2018 at 6:43 PM, John Roesler <john@confluent.io
> >>>
> >>>>>>> wrote:
> >>>>>>>>>
> >>>>>>>>>> Hi Matthias,
> >>>>>>>>>>
> >>>>>>>>>> No worries! Thanks for the reply.
> >>>>>>>>>>
> >>>>>>>>>> 1) There isn't a connection. I tried using the
> >> TopologyTestDriver
> >>>> to
> >>>>>>>> write
> >>>>>>>>>> a quick test exercising the current behavior and discovered
> >> that
> >>>> the
> >>>>>>>>>> metrics weren't available. It seemed like they should be, so I
> >>>> tacked
> >>>>>>>> it on
> >>>>>>>>>> to this KIP. If you feel it's inappropriate, I can pull it back
> >>>> out.
> >>>>>>>>>>
> >>>>>>>>>> 2) I was also concerned about that, but I figured it would come
> >>> up
> >>>> in
> >>>>>>>>>> discussion if I just went ahead and proposed it. And here we
> >> are!
> >>>>>>>>>>
> >>>>>>>>>> Here's my thought: maybe there are two classes of skips:
> >>>> "controlled"
> >>>>>>>> and
> >>>>>>>>>> "uncontrolled", where "controlled" means, as an app author, I
> >>>>>>>> deliberately
> >>>>>>>>>> filter out some events, and "uncontrolled" means that I simply
> >>>> don't
> >>>>>>>>>> account for some feature of the data, and the framework skips
> >>> them
> >>>>> (as
> >>>>>>>>>> opposed to crashing).
> >>>>>>>>>>
> >>>>>>>>>> In this breakdowns, the skips I'm adding metrics for are all
> >>>>>>>> uncontrolled
> >>>>>>>>>> skips (and we hope to measure all the uncontrolled skips). Our
> >>>> skips
> >>>>>>> are
> >>>>>>>>>> well documented, so it wouldn't be terrible to have an
> >>> application
> >>>> in
> >>>>>>>> which
> >>>>>>>>>> you know you expect to have tons of uncontrolled skips, but
> >> it's
> >>>> not
> >>>>>>>> great
> >>>>>>>>>> either, since you may also have some *unexpected* uncontrolled
> >>>> skips.
> >>>>>>>> It'll
> >>>>>>>>>> be difficult to notice, since you're probably not alerting on
> >> the
> >>>>>>> metric
> >>>>>>>>>> and filtering out the logs (whatever their level).
> >>>>>>>>>>
> >>>>>>>>>> I'd recommend any app author, as an alternative, to convert all
> >>>>>>> expected
> >>>>>>>>>> skips to controlled ones, by updating the topology to filter
> >>> those
> >>>>>>>> records
> >>>>>>>>>> out.
> >>>>>>>>>>
> >>>>>>>>>> Following from my recommendation, as a library author, I'm
> >>> inclined
> >>>>> to
> >>>>>>>> mark
> >>>>>>>>>> those logs WARN, since in my opinion, they should be concerning
> >>> to
> >>>>> the
> >>>>>>>> app
> >>>>>>>>>> authors. I'd definitely want to show, rather than hide, them by
> >>>>>>>> default, so
> >>>>>>>>>> I would pick INFO at least.
> >>>>>>>>>>
> >>>>>>>>>> That said, logging is always a tricky issue for lower-level
> >>>> libraries
> >>>>>>>> that
> >>>>>>>>>> run inside user code, since we don't have all the information
> >> we
> >>>> need
> >>>>>>> to
> >>>>>>>>>> make the right call.
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> On your last note, yeah, I got that impression from Guozhang as
> >>>> well.
> >>>>>>>>>> Thanks for the clarification.
> >>>>>>>>>>
> >>>>>>>>>> -John
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> On Mon, Apr 2, 2018 at 4:03 PM, Matthias J. Sax <
> >>>>>>> matthias@confluent.io>
> >>>>>>>>>> wrote:
> >>>>>>>>>>
> >>>>>>>>>>> John,
> >>>>>>>>>>>
> >>>>>>>>>>> sorry for my late reply and thanks for updating the KIP.
> >>>>>>>>>>>
> >>>>>>>>>>> I like your approach about "metrics are for monitoring, logs
> >> are
> >>>> for
> >>>>>>>>>>> debugging" -- however:
> >>>>>>>>>>>
> >>>>>>>>>>> 1) I don't see a connection between this and the task-level
> >>>> metrics
> >>>>>>>> that
> >>>>>>>>>>> you propose to get the metrics in `TopologyTestDriver`. I
> >> don't
> >>>>> think
> >>>>>>>>>>> people would monitor the `TopologyTestDriver` an thus
> >> wondering
> >>>> why
> >>>>>>> it
> >>>>>>>>>>> is important to include the metrics there? Thread-level metric
> >>>> might
> >>>>>>> be
> >>>>>>>>>>> easier to monitor though (ie, less different metric to
> >> monitor).
> >>>>>>>>>>>
> >>>>>>>>>>> 2) I am a little worried about WARN level logging and that it
> >>>> might
> >>>>>>> be
> >>>>>>>>>>> too chatty -- as you pointed out, it's about debugging, thus
> >>> DEBUG
> >>>>>>>> level
> >>>>>>>>>>> might be better. Not 100% sure about this to be honest. What
> >> is
> >>>> the
> >>>>>>>>>>> general assumption about the frequency for skipped records? I
> >>>> could
> >>>>>>>>>>> imagine cases for which skipped records are quite frequent and
> >>>> thus,
> >>>>>>>>>>> WARN level logs might "flood" the logs
> >>>>>>>>>>>
> >>>>>>>>>>> One final remark:
> >>>>>>>>>>>
> >>>>>>>>>>>> More
> >>>>>>>>>>>> generally, I would like to establish a pattern in which we
> >>> could
> >>>>> add
> >>>>>>>>>> new
> >>>>>>>>>>>> values for the "reason" tags without needing a KIP to do so.
> >>>>>>>>>>>
> >>>>>>>>>>> From my understanding, this is not feasible. Changing metrics
> >> is
> >>>>>>> always
> >>>>>>>>>>> considered a public API change, and we need a KIP for any
> >>> change.
> >>>> As
> >>>>>>> we
> >>>>>>>>>>> moved away from tagging, it doesn't matter for the KIP anymore
> >>> --
> >>>>>>> just
> >>>>>>>>>>> wanted to point it out.
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> -Matthias
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> On 3/30/18 2:47 PM, John Roesler wrote:
> >>>>>>>>>>>> Allrighty! The KIP is updated.
> >>>>>>>>>>>>
> >>>>>>>>>>>> Thanks again, all, for the feedback.
> >>>>>>>>>>>> -John
> >>>>>>>>>>>>
> >>>>>>>>>>>> On Fri, Mar 30, 2018 at 3:35 PM, John Roesler <
> >>> john@confluent.io
> >>>>>
> >>>>>>>>>> wrote:
> >>>>>>>>>>>>
> >>>>>>>>>>>>> Hey Guozhang and Bill,
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Ok, I'll update the KIP. At the risk of disturbing
> >> consensus,
> >>>> I'd
> >>>>>>>> like
> >>>>>>>>>>> to
> >>>>>>>>>>>>> put it in the task instead of the thread so that it'll show
> >> up
> >>>> in
> >>>>>>> the
> >>>>>>>>>>>>> TopologyTestDriver metrics as well.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> I'm leaning toward keeping the scope where it is right now,
> >>> but
> >>>> if
> >>>>>>>>>>> others
> >>>>>>>>>>>>> want to advocate for tossing in some more metrics, we can go
> >>>> that
> >>>>>>>>>> route.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Thanks all,
> >>>>>>>>>>>>> -John
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> On Fri, Mar 30, 2018 at 2:37 PM, Bill Bejeck <
> >>> bbejeck@gmail.com
> >>>>>
> >>>>>>>>>> wrote:
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>> Thanks for the KIP John, and sorry for the late comments.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> I'm on the fence with providing a single level metrics,
> >> but I
> >>>>>>> think
> >>>>>>>>>>> we'll
> >>>>>>>>>>>>>> have that discussion outside of this KIP.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> * maintain one skipped-record metric (could be per-thread,
> >>>>>>>> per-task,
> >>>>>>>>>>> or
> >>>>>>>>>>>>>>> per-processor-node) with no "reason"
> >>>>>>>>>>>>>>> * introduce a warn-level log detailing the
> >>>>> topic/partition/offset
> >>>>>>>>>> and
> >>>>>>>>>>>>>>> reason of the skipped record
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> I'm +1 on both of these suggestions.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Finally, we have had requests in the past for some metrics
> >>>> around
> >>>>>>>>>> when
> >>>>>>>>>>>>>> persistent store removes an expired window.  Would adding
> >>> that
> >>>> to
> >>>>>>>> our
> >>>>>>>>>>>>>> metrics stretch the scope of this KIP too much?
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Thanks again and overall I'm +1 on this KIP
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Bill
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> On Fri, Mar 30, 2018 at 2:00 PM, Guozhang Wang <
> >>>>>>> wangguoz@gmail.com>
> >>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> The proposal sounds good to me. About "maintain only one
> >>> level
> >>>>> of
> >>>>>>>>>>>>>> metrics"
> >>>>>>>>>>>>>>> maybe we can discuss about that separately from this KIP
> >>> since
> >>>>>>> that
> >>>>>>>>>>>>>> would
> >>>>>>>>>>>>>>> be a larger scope of discussion. I agree that if we are
> >>> going
> >>>> to
> >>>>>>>>>>>>>> maintain
> >>>>>>>>>>>>>>> only one-level metrics it should be lowest level and we
> >>> would
> >>>>> let
> >>>>>>>>>>> users
> >>>>>>>>>>>>>> to
> >>>>>>>>>>>>>>> do the roll-ups themselves, but I'm still not fully
> >>> convinced
> >>>>>>> that
> >>>>>>>>>> we
> >>>>>>>>>>>>>>> should just provide single-level metrics, because 1) I
> >> think
> >>>> for
> >>>>>>>>>>>>>> different
> >>>>>>>>>>>>>>> metrics people may be interested to investigate into
> >>> different
> >>>>>>>>>>>>>>> granularities, e.g. for poll / commit rate these are at
> >> the
> >>>>>>> lowest
> >>>>>>>>>>>>>>> task-level metrics, while for process-rate / skip-rate
> >> they
> >>>> can
> >>>>>>> be
> >>>>>>>>>> as
> >>>>>>>>>>>>>> low
> >>>>>>>>>>>>>>> as processor-node metrics, and 2) user-side rolling ups
> >> may
> >>>> not
> >>>>>>> be
> >>>>>>>>>>> very
> >>>>>>>>>>>>>>> straight-forward. But for 2) if someone can provide an
> >>>> efficient
> >>>>>>>> and
> >>>>>>>>>>>>>> easy
> >>>>>>>>>>>>>>> implementation of that I can be persuaded :)
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> For now I'm thinking we can add the metric on
> >> thread-level,
> >>>>>>> either
> >>>>>>>>>>> with
> >>>>>>>>>>>>>>> finer grained ones with "reason" tag plus an aggregated
> >> one
> >>>>>>> without
> >>>>>>>>>>> the
> >>>>>>>>>>>>>>> tag, or just having a single aggregated metric without the
> >>> tag
> >>>>>>>> looks
> >>>>>>>>>>>>>> good
> >>>>>>>>>>>>>>> to me.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Guozhang
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> On Fri, Mar 30, 2018 at 8:05 AM, John Roesler <
> >>>>> john@confluent.io
> >>>>>>>>
> >>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> Hey Guozhang,
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> Thanks for the reply. Regarding JMX, I can dig it. I'll
> >>>> provide
> >>>>>>> a
> >>>>>>>>>>>>>> list in
> >>>>>>>>>>>>>>>> the KIP. I was also thinking we'd better start a
> >>>> documentation
> >>>>>>>> page
> >>>>>>>>>>>>>> with
> >>>>>>>>>>>>>>>> the metrics listed.
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> I'd have no problem logging a warning when we skip
> >> records.
> >>>> On
> >>>>>>> the
> >>>>>>>>>>>>>> metric
> >>>>>>>>>>>>>>>> front, really I'm just pushing for us to maintain only
> >> one
> >>>>> level
> >>>>>>>> of
> >>>>>>>>>>>>>>>> metrics. If that's more or less granular (i.e., maybe we
> >>>> don't
> >>>>>>>>>> have a
> >>>>>>>>>>>>>>>> metric per reason and log the reason instead), that's
> >> fine
> >>> by
> >>>>>>> me.
> >>>>>>>> I
> >>>>>>>>>>>>>> just
> >>>>>>>>>>>>>>>> don't think it provides a lot of extra value per
> >> complexity
> >>>>>>>>>>> (interface
> >>>>>>>>>>>>>>> and
> >>>>>>>>>>>>>>>> implementation) to maintain roll-ups at the thread level
> >> in
> >>>>>>>>>> addition
> >>>>>>>>>>>>>> to
> >>>>>>>>>>>>>>>> lower-level metrics.
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> How about this instead:
> >>>>>>>>>>>>>>>> * maintain one skipped-record metric (could be
> >> per-thread,
> >>>>>>>>>> per-task,
> >>>>>>>>>>>>>> or
> >>>>>>>>>>>>>>>> per-processor-node) with no "reason"
> >>>>>>>>>>>>>>>> * introduce a warn-level log detailing the
> >>>>>>> topic/partition/offset
> >>>>>>>>>> and
> >>>>>>>>>>>>>>>> reason of the skipped record
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> If you like that, I can update the KIP.
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> Thanks,
> >>>>>>>>>>>>>>>> -John
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> On Thu, Mar 29, 2018 at 6:22 PM, Guozhang Wang <
> >>>>>>>> wangguoz@gmail.com
> >>>>>>>>>>>
> >>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> One thing you mention is the notion of setting alerts
> >> on
> >>>>>>> coarser
> >>>>>>>>>>>>>>>> metrics
> >>>>>>>>>>>>>>>>> being easier than finer ones. All the metric alerting
> >>>> systems
> >>>>> I
> >>>>>>>>>> have
> >>>>>>>>>>>>>>> used
> >>>>>>>>>>>>>>>>> make it equally easy to alert on metrics by-tag or over
> >>>> tags.
> >>>>>>> So
> >>>>>>>>>> my
> >>>>>>>>>>>>>>>>> experience doesn't say that this is a use case. Were you
> >>>>>>> thinking
> >>>>>>>>>>>>>> of an
> >>>>>>>>>>>>>>>>> alerting system that makes such a pre-aggregation
> >>> valuable?
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> For the commonly used JMX reporter tags will be encoded
> >>>>>>> directly
> >>>>>>>>>> as
> >>>>>>>>>>>>>>> part
> >>>>>>>>>>>>>>>> of
> >>>>>>>>>>>>>>>>> the object name, and if users wants to monitor them they
> >>>> need
> >>>>>>> to
> >>>>>>>>>>>>>> know
> >>>>>>>>>>>>>>>> these
> >>>>>>>>>>>>>>>>> values before hand. That is also why I think we do want
> >> to
> >>>>> list
> >>>>>>>>>> all
> >>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>> possible values of the reason tags in the KIP, since
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> In my email in response to Matthias, I gave an example
> >> of
> >>>> the
> >>>>>>>>>>>>>> kind of
> >>>>>>>>>>>>>>>>> scenario that would lead me as an operator to run with
> >>> DEBUG
> >>>>> on
> >>>>>>>>>> all
> >>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>> time, since I wouldn't be sure, having seen a skipped
> >>> record
> >>>>>>>> once,
> >>>>>>>>>>>>>> that
> >>>>>>>>>>>>>>>> it
> >>>>>>>>>>>>>>>>> would ever happen again. The solution is to capture all
> >>> the
> >>>>>>>>>>>>>> available
> >>>>>>>>>>>>>>>>> information about the reason and location of skips all
> >> the
> >>>>>>> time.
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> That is a good point. I think we can either expose all
> >>>> levels
> >>>>>>>>>>>>>> metrics
> >>>>>>>>>>>>>>> as
> >>>>>>>>>>>>>>>> by
> >>>>>>>>>>>>>>>>> default, or only expose the most lower-level metrics and
> >>> get
> >>>>>>> rid
> >>>>>>>>>> of
> >>>>>>>>>>>>>>> other
> >>>>>>>>>>>>>>>>> levels to let users do roll-ups themselves (which will
> >> be
> >>> a
> >>>>>>> much
> >>>>>>>>>>>>>> larger
> >>>>>>>>>>>>>>>>> scope for discussion), or we can encourage users to not
> >>>> purely
> >>>>>>>>>>>>>> depend
> >>>>>>>>>>>>>>> on
> >>>>>>>>>>>>>>>>> metrics for such trouble shooting: that is to say, users
> >>>> only
> >>>>>>> be
> >>>>>>>>>>>>>>> alerted
> >>>>>>>>>>>>>>>>> based on metrics, and we can log a info / warn log4j
> >> entry
> >>>>> each
> >>>>>>>>>>>>>> time we
> >>>>>>>>>>>>>>>> are
> >>>>>>>>>>>>>>>>> about to skip a record all over the places, so that upon
> >>>> being
> >>>>>>>>>>>>>> notified
> >>>>>>>>>>>>>>>>> users can look into the logs to find the details on
> >> where
> >>> /
> >>>>>>> when
> >>>>>>>>>> it
> >>>>>>>>>>>>>>>>> happens. WDYT?
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> Guozhang
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> On Thu, Mar 29, 2018 at 3:57 PM, John Roesler <
> >>>>>>> john@confluent.io
> >>>>>>>>>
> >>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> Hey Guozhang,
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> Thanks for the review.
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> 1.
> >>>>>>>>>>>>>>>>>> Matthias raised the same question about the "reason"
> >> tag
> >>>>>>> values.
> >>>>>>>>>> I
> >>>>>>>>>>>>>>> can
> >>>>>>>>>>>>>>>>> list
> >>>>>>>>>>>>>>>>>> all possible values of the "reason" tag, but I'm
> >> thinking
> >>>>> this
> >>>>>>>>>>>>>> level
> >>>>>>>>>>>>>>> of
> >>>>>>>>>>>>>>>>>> detail may not be KIP-worthy, maybe the code and
> >>>>> documentation
> >>>>>>>>>>>>>> review
> >>>>>>>>>>>>>>>>> would
> >>>>>>>>>>>>>>>>>> be sufficient. If you all disagree and would like it
> >>>> included
> >>>>>>> in
> >>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>> KIP, I
> >>>>>>>>>>>>>>>>>> can certainly do that.
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> If we do provide roll-up metrics, I agree with the
> >>> pattern
> >>>> of
> >>>>>>>>>>>>>> keeping
> >>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>> same name but eliminating the tags for the dimensions
> >>> that
> >>>>>>> were
> >>>>>>>>>>>>>>>>> rolled-up.
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> 2.
> >>>>>>>>>>>>>>>>>> I'm not too sure that implementation efficiency really
> >>>>>>> becomes a
> >>>>>>>>>>>>>>> factor
> >>>>>>>>>>>>>>>>> in
> >>>>>>>>>>>>>>>>>> choosing whether to (by default) update one coarse
> >> metric
> >>>> at
> >>>>>>> the
> >>>>>>>>>>>>>>> thread
> >>>>>>>>>>>>>>>>>> level or one granular metric at the processor-node
> >> level,
> >>>>>>> since
> >>>>>>>>>>>>>> it's
> >>>>>>>>>>>>>>>> just
> >>>>>>>>>>>>>>>>>> one metric being updated either way. I do agree that if
> >>> we
> >>>>>>> were
> >>>>>>>>>> to
> >>>>>>>>>>>>>>>> update
> >>>>>>>>>>>>>>>>>> the granular metrics and multiple roll-ups, then we
> >>> should
> >>>>>>>>>>>>>> consider
> >>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>> efficiency.
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> I agree it's probably not necessary to surface the
> >>> metrics
> >>>>> for
> >>>>>>>>>> all
> >>>>>>>>>>>>>>>> nodes
> >>>>>>>>>>>>>>>>>> regardless of whether they can or do skip records.
> >>> Perhaps
> >>>> we
> >>>>>>>> can
> >>>>>>>>>>>>>>>> lazily
> >>>>>>>>>>>>>>>>>> register the metrics.
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> In my email in response to Matthias, I gave an example
> >> of
> >>>> the
> >>>>>>>>>>>>>> kind of
> >>>>>>>>>>>>>>>>>> scenario that would lead me as an operator to run with
> >>>> DEBUG
> >>>>>>> on
> >>>>>>>>>>>>>> all
> >>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>> time, since I wouldn't be sure, having seen a skipped
> >>>> record
> >>>>>>>>>> once,
> >>>>>>>>>>>>>>> that
> >>>>>>>>>>>>>>>>> it
> >>>>>>>>>>>>>>>>>> would ever happen again. The solution is to capture all
> >>> the
> >>>>>>>>>>>>>> available
> >>>>>>>>>>>>>>>>>> information about the reason and location of skips all
> >>> the
> >>>>>>> time.
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> One thing you mention is the notion of setting alerts
> >> on
> >>>>>>> coarser
> >>>>>>>>>>>>>>>> metrics
> >>>>>>>>>>>>>>>>>> being easier than finer ones. All the metric alerting
> >>>> systems
> >>>>>>> I
> >>>>>>>>>>>>>> have
> >>>>>>>>>>>>>>>> used
> >>>>>>>>>>>>>>>>>> make it equally easy to alert on metrics by-tag or over
> >>>> tags.
> >>>>>>> So
> >>>>>>>>>>>>>> my
> >>>>>>>>>>>>>>>>>> experience doesn't say that this is a use case. Were
> >> you
> >>>>>>>> thinking
> >>>>>>>>>>>>>> of
> >>>>>>>>>>>>>>> an
> >>>>>>>>>>>>>>>>>> alerting system that makes such a pre-aggregation
> >>> valuable?
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> Thanks again,
> >>>>>>>>>>>>>>>>>> -John
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> On Thu, Mar 29, 2018 at 5:24 PM, Guozhang Wang <
> >>>>>>>>>>>>>> wangguoz@gmail.com>
> >>>>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> Hello John,
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> Thanks for the KIP. Some comments:
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> 1. Could you list all the possible values of the
> >>> "reason"
> >>>>>>> tag?
> >>>>>>>>>>>>>> In
> >>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>> JIRA
> >>>>>>>>>>>>>>>>>>> ticket I left some potential reasons but I'm not clear
> >>> if
> >>>>>>>> you're
> >>>>>>>>>>>>>>>> going
> >>>>>>>>>>>>>>>>> to
> >>>>>>>>>>>>>>>>>>> categorize each of them as a separate reason, or is
> >>> there
> >>>>> any
> >>>>>>>>>>>>>>>>> additional
> >>>>>>>>>>>>>>>>>>> ones you have in mind.
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> Also I'm wondering if we should add another metric
> >> that
> >>> do
> >>>>>>> not
> >>>>>>>>>>>>>> have
> >>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>> reason tag but aggregates among all possible reasons?
> >>> This
> >>>>> is
> >>>>>>>>>>>>>> for
> >>>>>>>>>>>>>>>> users
> >>>>>>>>>>>>>>>>>> to
> >>>>>>>>>>>>>>>>>>> easily set their alerting notifications (otherwise
> >> they
> >>>> have
> >>>>>>> to
> >>>>>>>>>>>>>>> write
> >>>>>>>>>>>>>>>>> on
> >>>>>>>>>>>>>>>>>>> notification rule per reason) in their monitoring
> >>> systems.
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> 2. Note that the processor-node metrics is actually
> >>>>>>>> "per-thread,
> >>>>>>>>>>>>>>>>>> per-task,
> >>>>>>>>>>>>>>>>>>> per-processor-node", and today we only set the
> >>> per-thread
> >>>>>>>>>>>>>> metrics
> >>>>>>>>>>>>>>> as
> >>>>>>>>>>>>>>>>> INFO
> >>>>>>>>>>>>>>>>>>> while leaving the lower two layers as DEBUG. I agree
> >>> with
> >>>>>>> your
> >>>>>>>>>>>>>>>> argument
> >>>>>>>>>>>>>>>>>>> that we are missing the per-client roll-up metrics
> >>> today,
> >>>>> but
> >>>>>>>>>>>>>> I'm
> >>>>>>>>>>>>>>>>>> convinced
> >>>>>>>>>>>>>>>>>>> that the right way to approach it would be
> >>>>>>>>>>>>>>>> "just-providing-the-lowest-
> >>>>>>>>>>>>>>>>>>> level
> >>>>>>>>>>>>>>>>>>> metrics only".
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> Note the recoding implementation of these three levels
> >>> are
> >>>>>>>>>>>>>>> different
> >>>>>>>>>>>>>>>>>>> internally today: we did not just do the rolling up to
> >>>>>>> generate
> >>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>> higher-level metrics from the lower level ones, but we
> >>>> just
> >>>>>>>>>>>>>> record
> >>>>>>>>>>>>>>>> them
> >>>>>>>>>>>>>>>>>>> separately, which means that, if we turn on multiple
> >>>> levels
> >>>>>>> of
> >>>>>>>>>>>>>>>> metrics,
> >>>>>>>>>>>>>>>>>> we
> >>>>>>>>>>>>>>>>>>> maybe duplicate collecting some metrics. One can argue
> >>>> that
> >>>>>>> is
> >>>>>>>>>>>>>> not
> >>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>> best
> >>>>>>>>>>>>>>>>>>> way to represent multi-level metrics collecting and
> >>>>>>> reporting,
> >>>>>>>>>>>>>> but
> >>>>>>>>>>>>>>> by
> >>>>>>>>>>>>>>>>>> only
> >>>>>>>>>>>>>>>>>>> enabling thread-level metrics as INFO today, that
> >>>>>>>> implementation
> >>>>>>>>>>>>>>>> could
> >>>>>>>>>>>>>>>>> be
> >>>>>>>>>>>>>>>>>>> more efficient than only collecting the metrics at the
> >>>>> lowest
> >>>>>>>>>>>>>>> level,
> >>>>>>>>>>>>>>>>> and
> >>>>>>>>>>>>>>>>>>> then do the roll-up calculations outside of the
> >> metrics
> >>>>>>>> classes.
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> Plus, today not all processor-nodes may possibly skip
> >>>>>>> records,
> >>>>>>>>>>>>>>> AFAIK
> >>>>>>>>>>>>>>>> we
> >>>>>>>>>>>>>>>>>>> will only skip records at the source, sink, window and
> >>>>>>>>>>>>>> aggregation
> >>>>>>>>>>>>>>>>>>> processor nodes, so adding a metric per processor
> >> looks
> >>>> like
> >>>>>>> an
> >>>>>>>>>>>>>>>>> overkill
> >>>>>>>>>>>>>>>>>> to
> >>>>>>>>>>>>>>>>>>> me as well. On the other hand, from user's perspective
> >>> the
> >>>>>>>>>>>>>> "reason"
> >>>>>>>>>>>>>>>> tag
> >>>>>>>>>>>>>>>>>> may
> >>>>>>>>>>>>>>>>>>> be sufficient for them to narrow down where inside the
> >>>>>>> topology
> >>>>>>>>>>>>>> is
> >>>>>>>>>>>>>>>>>> causing
> >>>>>>>>>>>>>>>>>>> records to be dropped on the floor. So I think the
> >>>>>>> "per-thread,
> >>>>>>>>>>>>>>>>> per-task"
> >>>>>>>>>>>>>>>>>>> level metrics should be sufficient for them in trouble
> >>>> shoot
> >>>>>>> in
> >>>>>>>>>>>>>>> DEBUG
> >>>>>>>>>>>>>>>>>> mode,
> >>>>>>>>>>>>>>>>>>> and we can add another "per-thread" level metrics as
> >>> INFO
> >>>>>>> which
> >>>>>>>>>>>>>> is
> >>>>>>>>>>>>>>>>> turned
> >>>>>>>>>>>>>>>>>>> on by default. So under normal execution users still
> >>> only
> >>>>>>> need
> >>>>>>>>>>>>>> INFO
> >>>>>>>>>>>>>>>>> level
> >>>>>>>>>>>>>>>>>>> metrics for alerting (e.g. set alerts on all
> >>>> skipped-records
> >>>>>>>>>>>>>>> metrics
> >>>>>>>>>>>>>>>> as
> >>>>>>>>>>>>>>>>>>> non-zero), and then upon trouble shooting they can
> >> turn
> >>> on
> >>>>>>>> DEBUG
> >>>>>>>>>>>>>>>>> metrics
> >>>>>>>>>>>>>>>>>> to
> >>>>>>>>>>>>>>>>>>> look into which task is actually causing the skipped
> >>>>> records.
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> Guozhang
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> On Thu, Mar 29, 2018 at 2:03 PM, Matthias J. Sax <
> >>>>>>>>>>>>>>>>> matthias@confluent.io>
> >>>>>>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> Thanks for the KIP John.
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> Reading the material on the related Jiras, I am
> >>> wondering
> >>>>>>> what
> >>>>>>>>>>>>>>>>> `reason`
> >>>>>>>>>>>>>>>>>>>> tags you want to introduce? Can you elaborate? The
> >> KIP
> >>>>>>> should
> >>>>>>>>>>>>>>> list
> >>>>>>>>>>>>>>>>>> those
> >>>>>>>>>>>>>>>>>>>> IMHO.
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> About the fine grained metrics vs the roll-up: you
> >> say
> >>>> that
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>> the coarse metric aggregates across two dimensions
> >>>>>>>>>>>>>>> simultaneously
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> Can you elaborate why this is an issue? I am not
> >>>> convinced
> >>>>>>> atm
> >>>>>>>>>>>>>>> that
> >>>>>>>>>>>>>>>>> we
> >>>>>>>>>>>>>>>>>>>> should put the fine grained metrics into INFO level
> >> and
> >>>>>>> remove
> >>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>>> roll-up at thread level.
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>> Given that they have to do this sum to get a usable
> >>>>>>>>>>>>>> top-level
> >>>>>>>>>>>>>>>> view
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> This is a fair concern, but I don't share the
> >>> conclusion.
> >>>>>>>>>>>>>>> Offering
> >>>>>>>>>>>>>>>> a
> >>>>>>>>>>>>>>>>>>>> built-in `KafkaStreams` "client" roll-up out of the
> >> box
> >>>>>>> might
> >>>>>>>>>>>>>> be
> >>>>>>>>>>>>>>> a
> >>>>>>>>>>>>>>>>>>>> better solution. In the past we did not offer this
> >> due
> >>> to
> >>>>>>>>>>>>>>>> performance
> >>>>>>>>>>>>>>>>>>>> concerns, but we could allow an "opt-in" mechanism.
> >> If
> >>>> you
> >>>>>>>>>>>>>>>> disagree,
> >>>>>>>>>>>>>>>>>> can
> >>>>>>>>>>>>>>>>>>>> you provide some reasoning and add them to the
> >>> "Rejected
> >>>>>>>>>>>>>>>>> alternatives"
> >>>>>>>>>>>>>>>>>>>> section.
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> To rephrase: I understand the issue about missing
> >>>> top-level
> >>>>>>>>>>>>>> view,
> >>>>>>>>>>>>>>>> but
> >>>>>>>>>>>>>>>>>>>> instead of going more fine grained, we should
> >> consider
> >>> to
> >>>>>>> add
> >>>>>>>>>>>>>>> this
> >>>>>>>>>>>>>>>>>>>> top-level view and add/keep the fine grained metrics
> >> at
> >>>>>>> DEBUG
> >>>>>>>>>>>>>>> level
> >>>>>>>>>>>>>>>>>> only
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> I am +1 to add TopologyTestDriver#metrics() and to
> >>> remove
> >>>>>>> old
> >>>>>>>>>>>>>>>> metrics
> >>>>>>>>>>>>>>>>>>>> directly as you suggested.
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> -Matthias
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> On 3/28/18 6:42 PM, Ted Yu wrote:
> >>>>>>>>>>>>>>>>>>>>> Looks good to me.
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>> On Wed, Mar 28, 2018 at 3:11 PM, John Roesler <
> >>>>>>>>>>>>>>> john@confluent.io
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>> Hello all,
> >>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>> I am proposing KIP-274 to improve the metrics
> >> around
> >>>>>>>>>>>>>> skipped
> >>>>>>>>>>>>>>>>> records
> >>>>>>>>>>>>>>>>>>> in
> >>>>>>>>>>>>>>>>>>>>>> Streams.
> >>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>> Please find the details here:
> >>>>>>>>>>>>>>>>>>>>>> https://cwiki.apache.org/
> >>> confluence/display/KAFKA/KIP-
> >>>>>>>>>>>>>>>>>>>>>> 274%3A+Kafka+Streams+Skipped+Records+Metrics
> >>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>> Please let me know what you think!
> >>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>> Thanks,
> >>>>>>>>>>>>>>>>>>>>>> -John
> >>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> --
> >>>>>>>>>>>>>>>>>>> -- Guozhang
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> --
> >>>>>>>>>>>>>>>>> -- Guozhang
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> --
> >>>>>>>>>>>>>>> -- Guozhang
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>
> >>>>>
> >>>>
> >>>>
> >>>> --
> >>>> -- Guozhang
> >>>>
> >>>
> >>
> >
> >
> >
>
>

Re: [DISCUSS] KIP-274: Kafka Streams Skipped Records Metrics

Posted by "Matthias J. Sax" <ma...@confluent.io>.

Sounds great!

The cryptic topic names can be an issue -- however, people can
`describe()`  their topology to map the name to the corresponding
sub-topology/tasks to narrow the error down to the corresponding
operators. I think, this should be "sufficient for now" for debugging.

Renaming those topic seems to be out-of-scope for this KIP.


-Matthias

On 4/3/18 2:45 PM, Guozhang Wang wrote:
> Thanks John, your proposal looks fine to me.
> 
> I'll go ahead and look into the PR for more details myself.
> 
> 
> Guozhang
> 
> On Tue, Apr 3, 2018 at 1:35 PM, Bill Bejeck <bb...@gmail.com> wrote:
> 
>> Hi John,
>>
>> Thanks for making the updates.
>>
>> I agree with the information you've included in the logs as described
>> above, as log statements without enough context/information can be
>> frustrating.
>>
>> -Bill
>>
>> On Tue, Apr 3, 2018 at 3:29 PM, John Roesler <jo...@confluent.io> wrote:
>>
>>> Allrighty, how about this, then...
>>>
>>> I'll move the metric back to the StreamThread and maintain the existing
>> tag
>>> (client-id=...(per-thread client-id)). It won't be present in the
>>> TopologyTestDriver's metrics.
>>>
>>> As a side note, I'm not sure that the location of the log messages has
>>> visibility into the name of the thread or the task, or the processor
>> node,
>>> for that matter. But at the end of the day, I don't think it really
>>> matters.
>>>
>>> None of those identifiers are in the public interface or user-controlled.
>>> For them to be useful for debugging, users would have to gain a very deep
>>> understanding of how their DSL program gets executed. From my
>> perspective,
>>> they are all included in metric tags only to prevent collisions between
>> the
>>> same metrics in different (e.g.) threads.
>>>
>>> I think what's important is to provide the right information in the logs
>>> that users will be able to debug their issues. This is why the logs in my
>>> pr include the topic/partition/offset of the offending data, as well as
>> the
>>> stacktrace of the exception from the deserializer (or for timestamps, the
>>> extracted timestamp and the class name of their extractor). This
>>> information alone should let them pinpoint the offending data and fix it.
>>>
>>> (I am aware that that topic name might be a repartition topic, and
>>> therefore also esoteric from the user's perspective, but I think it's the
>>> best we can do right now. It might be nice to explicitly take on a
>>> debugging ergonomics task in the future and give all processor nodes
>>> human-friendly names. Then, we could surface these names in any logs or
>>> exceptions. But I'm inclined to call this out-of-scope for now.)
>>>
>>> Thanks again,
>>> -John
>>>
>>> On Tue, Apr 3, 2018 at 1:40 PM, Guozhang Wang <wa...@gmail.com>
>> wrote:
>>>
>>>> 1. If we can indeed gather all the context information from the log4j
>>>> entries I'd suggest we change to thread-level (I'm not sure if that is
>>>> doable, so if John have already some WIP PR that can help us decide).
>>>>
>>>> 2. We can consider adding the API in TopologyTestDriver for general
>>> testing
>>>> purposes; that being said, I think Matthias has a good point that this
>>>> alone should not be a driving motivation for us to keep this metric as
>>>> task-level if 1) is true.
>>>>
>>>>
>>>>
>>>> Guozhang
>>>>
>>>>
>>>> On Tue, Apr 3, 2018 at 11:36 AM, Matthias J. Sax <
>> matthias@confluent.io>
>>>> wrote:
>>>>
>>>>> Thanks Guozhang, that was my intent.
>>>>>
>>>>> @John: yes, we should not nail down the exact log message. It's just
>> to
>>>>> point out the trade-off. If we can get the required information in
>> the
>>>>> logs, we might not need task level metrics.
>>>>>
>>>>>
>>>>> -Matthias
>>>>>
>>>>> On 4/3/18 11:26 AM, Guozhang Wang wrote:
>>>>>> I think Matthias' comment is that, we can still record the metrics
>> on
>>>> the
>>>>>> thread-level, while having the WARN log entry to include sufficient
>>>>> context
>>>>>> information so that users can still easily narrow down the
>>>> investigation
>>>>>> scope.
>>>>>>
>>>>>>
>>>>>> Guozhang
>>>>>>
>>>>>> On Tue, Apr 3, 2018 at 11:22 AM, John Roesler <jo...@confluent.io>
>>>> wrote:
>>>>>>
>>>>>>> I agree we should add as much information as is reasonable to the
>>> log.
>>>>> For
>>>>>>> example, see this WIP PR I started for this KIP:
>>>>>>>
>>>>>>> https://github.com/apache/kafka/pull/4812/files#diff-
>>>>>>> 88d129f048bc842c7db5b2566a45fce8R80
>>>>>>>
>>>>>>> and
>>>>>>>
>>>>>>> https://github.com/apache/kafka/pull/4812/files#diff-
>>>>>>> 69e6789eb675ec978a1abd24fed96eb1R111
>>>>>>>
>>>>>>> I'm not sure if we should nail down the log messages in the KIP or
>>> in
>>>>> the
>>>>>>> PR discussion. What say you?
>>>>>>>
>>>>>>> Thanks,
>>>>>>> -John
>>>>>>>
>>>>>>> On Tue, Apr 3, 2018 at 12:20 AM, Matthias J. Sax <
>>>> matthias@confluent.io
>>>>>>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Thanks for sharing your thoughts. As I mentioned originally, I am
>>> not
>>>>>>>> sure about the right log level either. Your arguments are
>>> convincing
>>>> --
>>>>>>>> thus, I am fine with keeping WARN level.
>>>>>>>>
>>>>>>>> The task vs thread level argument is an interesting one.
>> However, I
>>>> am
>>>>>>>> wondering if we should add this information into the
>> corresponding
>>>> WARN
>>>>>>>> logs that we write anyway? For this case, we can also log the
>>>>>>>> corresponding operator (and other information like topic name etc
>>> if
>>>>>>>> needed). WDYT about this?
>>>>>>>>
>>>>>>>>
>>>>>>>> -Matthias
>>>>>>>>
>>>>>>>> On 4/2/18 8:31 PM, Guozhang Wang wrote:
>>>>>>>>> Regarding logging: I'm inclined to keep logging at WARN level
>>> since
>>>>>>>> skipped
>>>>>>>>> records are not expected in normal execution (for all reasons
>> that
>>>> we
>>>>>>> are
>>>>>>>>> aware of), and hence when error happens users should be alerted
>>> from
>>>>>>>>> metrics and looked into the log files, so to me if it is really
>>>>>>> spamming
>>>>>>>>> the log files it is also a good alert for users. Besides for
>>>>>>> deserialize
>>>>>>>>> errors we already log at WARN level for this reason.
>>>>>>>>>
>>>>>>>>> Regarding the metrics-levels: I was pondering on that as well.
>>> What
>>>>>>> made
>>>>>>>> me
>>>>>>>>> to think and agree on task-level than thread-level is that for
>>> some
>>>>>>>> reasons
>>>>>>>>> like window retention, they may possibly be happening on a
>> subset
>>> of
>>>>>>>> input
>>>>>>>>> partitions, and tasks are correlated with partitions the
>>> task-level
>>>>>>>> metrics
>>>>>>>>> can help users to narrow down on the specific input data
>>> partitions.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Guozhang
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Mon, Apr 2, 2018 at 6:43 PM, John Roesler <john@confluent.io
>>>
>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> Hi Matthias,
>>>>>>>>>>
>>>>>>>>>> No worries! Thanks for the reply.
>>>>>>>>>>
>>>>>>>>>> 1) There isn't a connection. I tried using the
>> TopologyTestDriver
>>>> to
>>>>>>>> write
>>>>>>>>>> a quick test exercising the current behavior and discovered
>> that
>>>> the
>>>>>>>>>> metrics weren't available. It seemed like they should be, so I
>>>> tacked
>>>>>>>> it on
>>>>>>>>>> to this KIP. If you feel it's inappropriate, I can pull it back
>>>> out.
>>>>>>>>>>
>>>>>>>>>> 2) I was also concerned about that, but I figured it would come
>>> up
>>>> in
>>>>>>>>>> discussion if I just went ahead and proposed it. And here we
>> are!
>>>>>>>>>>
>>>>>>>>>> Here's my thought: maybe there are two classes of skips:
>>>> "controlled"
>>>>>>>> and
>>>>>>>>>> "uncontrolled", where "controlled" means, as an app author, I
>>>>>>>> deliberately
>>>>>>>>>> filter out some events, and "uncontrolled" means that I simply
>>>> don't
>>>>>>>>>> account for some feature of the data, and the framework skips
>>> them
>>>>> (as
>>>>>>>>>> opposed to crashing).
>>>>>>>>>>
>>>>>>>>>> In this breakdowns, the skips I'm adding metrics for are all
>>>>>>>> uncontrolled
>>>>>>>>>> skips (and we hope to measure all the uncontrolled skips). Our
>>>> skips
>>>>>>> are
>>>>>>>>>> well documented, so it wouldn't be terrible to have an
>>> application
>>>> in
>>>>>>>> which
>>>>>>>>>> you know you expect to have tons of uncontrolled skips, but
>> it's
>>>> not
>>>>>>>> great
>>>>>>>>>> either, since you may also have some *unexpected* uncontrolled
>>>> skips.
>>>>>>>> It'll
>>>>>>>>>> be difficult to notice, since you're probably not alerting on
>> the
>>>>>>> metric
>>>>>>>>>> and filtering out the logs (whatever their level).
>>>>>>>>>>
>>>>>>>>>> I'd recommend any app author, as an alternative, to convert all
>>>>>>> expected
>>>>>>>>>> skips to controlled ones, by updating the topology to filter
>>> those
>>>>>>>> records
>>>>>>>>>> out.
>>>>>>>>>>
>>>>>>>>>> Following from my recommendation, as a library author, I'm
>>> inclined
>>>>> to
>>>>>>>> mark
>>>>>>>>>> those logs WARN, since in my opinion, they should be concerning
>>> to
>>>>> the
>>>>>>>> app
>>>>>>>>>> authors. I'd definitely want to show, rather than hide, them by
>>>>>>>> default, so
>>>>>>>>>> I would pick INFO at least.
>>>>>>>>>>
>>>>>>>>>> That said, logging is always a tricky issue for lower-level
>>>> libraries
>>>>>>>> that
>>>>>>>>>> run inside user code, since we don't have all the information
>> we
>>>> need
>>>>>>> to
>>>>>>>>>> make the right call.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On your last note, yeah, I got that impression from Guozhang as
>>>> well.
>>>>>>>>>> Thanks for the clarification.
>>>>>>>>>>
>>>>>>>>>> -John
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Mon, Apr 2, 2018 at 4:03 PM, Matthias J. Sax <
>>>>>>> matthias@confluent.io>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> John,
>>>>>>>>>>>
>>>>>>>>>>> sorry for my late reply and thanks for updating the KIP.
>>>>>>>>>>>
>>>>>>>>>>> I like your approach about "metrics are for monitoring, logs
>> are
>>>> for
>>>>>>>>>>> debugging" -- however:
>>>>>>>>>>>
>>>>>>>>>>> 1) I don't see a connection between this and the task-level
>>>> metrics
>>>>>>>> that
>>>>>>>>>>> you propose to get the metrics in `TopologyTestDriver`. I
>> don't
>>>>> think
>>>>>>>>>>> people would monitor the `TopologyTestDriver` an thus
>> wondering
>>>> why
>>>>>>> it
>>>>>>>>>>> is important to include the metrics there? Thread-level metric
>>>> might
>>>>>>> be
>>>>>>>>>>> easier to monitor though (ie, less different metric to
>> monitor).
>>>>>>>>>>>
>>>>>>>>>>> 2) I am a little worried about WARN level logging and that it
>>>> might
>>>>>>> be
>>>>>>>>>>> too chatty -- as you pointed out, it's about debugging, thus
>>> DEBUG
>>>>>>>> level
>>>>>>>>>>> might be better. Not 100% sure about this to be honest. What
>> is
>>>> the
>>>>>>>>>>> general assumption about the frequency for skipped records? I
>>>> could
>>>>>>>>>>> imagine cases for which skipped records are quite frequent and
>>>> thus,
>>>>>>>>>>> WARN level logs might "flood" the logs
>>>>>>>>>>>
>>>>>>>>>>> One final remark:
>>>>>>>>>>>
>>>>>>>>>>>> More
>>>>>>>>>>>> generally, I would like to establish a pattern in which we
>>> could
>>>>> add
>>>>>>>>>> new
>>>>>>>>>>>> values for the "reason" tags without needing a KIP to do so.
>>>>>>>>>>>
>>>>>>>>>>> From my understanding, this is not feasible. Changing metrics
>> is
>>>>>>> always
>>>>>>>>>>> considered a public API change, and we need a KIP for any
>>> change.
>>>> As
>>>>>>> we
>>>>>>>>>>> moved away from tagging, it doesn't matter for the KIP anymore
>>> --
>>>>>>> just
>>>>>>>>>>> wanted to point it out.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> -Matthias
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On 3/30/18 2:47 PM, John Roesler wrote:
>>>>>>>>>>>> Allrighty! The KIP is updated.
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks again, all, for the feedback.
>>>>>>>>>>>> -John
>>>>>>>>>>>>
>>>>>>>>>>>> On Fri, Mar 30, 2018 at 3:35 PM, John Roesler <
>>> john@confluent.io
>>>>>
>>>>>>>>>> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Hey Guozhang and Bill,
>>>>>>>>>>>>>
>>>>>>>>>>>>> Ok, I'll update the KIP. At the risk of disturbing
>> consensus,
>>>> I'd
>>>>>>>> like
>>>>>>>>>>> to
>>>>>>>>>>>>> put it in the task instead of the thread so that it'll show
>> up
>>>> in
>>>>>>> the
>>>>>>>>>>>>> TopologyTestDriver metrics as well.
>>>>>>>>>>>>>
>>>>>>>>>>>>> I'm leaning toward keeping the scope where it is right now,
>>> but
>>>> if
>>>>>>>>>>> others
>>>>>>>>>>>>> want to advocate for tossing in some more metrics, we can go
>>>> that
>>>>>>>>>> route.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Thanks all,
>>>>>>>>>>>>> -John
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Fri, Mar 30, 2018 at 2:37 PM, Bill Bejeck <
>>> bbejeck@gmail.com
>>>>>
>>>>>>>>>> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thanks for the KIP John, and sorry for the late comments.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I'm on the fence with providing a single level metrics,
>> but I
>>>>>>> think
>>>>>>>>>>> we'll
>>>>>>>>>>>>>> have that discussion outside of this KIP.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> * maintain one skipped-record metric (could be per-thread,
>>>>>>>> per-task,
>>>>>>>>>>> or
>>>>>>>>>>>>>>> per-processor-node) with no "reason"
>>>>>>>>>>>>>>> * introduce a warn-level log detailing the
>>>>> topic/partition/offset
>>>>>>>>>> and
>>>>>>>>>>>>>>> reason of the skipped record
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I'm +1 on both of these suggestions.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Finally, we have had requests in the past for some metrics
>>>> around
>>>>>>>>>> when
>>>>>>>>>>>>>> persistent store removes an expired window.  Would adding
>>> that
>>>> to
>>>>>>>> our
>>>>>>>>>>>>>> metrics stretch the scope of this KIP too much?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thanks again and overall I'm +1 on this KIP
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Bill
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Fri, Mar 30, 2018 at 2:00 PM, Guozhang Wang <
>>>>>>> wangguoz@gmail.com>
>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> The proposal sounds good to me. About "maintain only one
>>> level
>>>>> of
>>>>>>>>>>>>>> metrics"
>>>>>>>>>>>>>>> maybe we can discuss about that separately from this KIP
>>> since
>>>>>>> that
>>>>>>>>>>>>>> would
>>>>>>>>>>>>>>> be a larger scope of discussion. I agree that if we are
>>> going
>>>> to
>>>>>>>>>>>>>> maintain
>>>>>>>>>>>>>>> only one-level metrics it should be lowest level and we
>>> would
>>>>> let
>>>>>>>>>>> users
>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>> do the roll-ups themselves, but I'm still not fully
>>> convinced
>>>>>>> that
>>>>>>>>>> we
>>>>>>>>>>>>>>> should just provide single-level metrics, because 1) I
>> think
>>>> for
>>>>>>>>>>>>>> different
>>>>>>>>>>>>>>> metrics people may be interested to investigate into
>>> different
>>>>>>>>>>>>>>> granularities, e.g. for poll / commit rate these are at
>> the
>>>>>>> lowest
>>>>>>>>>>>>>>> task-level metrics, while for process-rate / skip-rate
>> they
>>>> can
>>>>>>> be
>>>>>>>>>> as
>>>>>>>>>>>>>> low
>>>>>>>>>>>>>>> as processor-node metrics, and 2) user-side rolling ups
>> may
>>>> not
>>>>>>> be
>>>>>>>>>>> very
>>>>>>>>>>>>>>> straight-forward. But for 2) if someone can provide an
>>>> efficient
>>>>>>>> and
>>>>>>>>>>>>>> easy
>>>>>>>>>>>>>>> implementation of that I can be persuaded :)
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> For now I'm thinking we can add the metric on
>> thread-level,
>>>>>>> either
>>>>>>>>>>> with
>>>>>>>>>>>>>>> finer grained ones with "reason" tag plus an aggregated
>> one
>>>>>>> without
>>>>>>>>>>> the
>>>>>>>>>>>>>>> tag, or just having a single aggregated metric without the
>>> tag
>>>>>>>> looks
>>>>>>>>>>>>>> good
>>>>>>>>>>>>>>> to me.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Guozhang
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Fri, Mar 30, 2018 at 8:05 AM, John Roesler <
>>>>> john@confluent.io
>>>>>>>>
>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Hey Guozhang,
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Thanks for the reply. Regarding JMX, I can dig it. I'll
>>>> provide
>>>>>>> a
>>>>>>>>>>>>>> list in
>>>>>>>>>>>>>>>> the KIP. I was also thinking we'd better start a
>>>> documentation
>>>>>>>> page
>>>>>>>>>>>>>> with
>>>>>>>>>>>>>>>> the metrics listed.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I'd have no problem logging a warning when we skip
>> records.
>>>> On
>>>>>>> the
>>>>>>>>>>>>>> metric
>>>>>>>>>>>>>>>> front, really I'm just pushing for us to maintain only
>> one
>>>>> level
>>>>>>>> of
>>>>>>>>>>>>>>>> metrics. If that's more or less granular (i.e., maybe we
>>>> don't
>>>>>>>>>> have a
>>>>>>>>>>>>>>>> metric per reason and log the reason instead), that's
>> fine
>>> by
>>>>>>> me.
>>>>>>>> I
>>>>>>>>>>>>>> just
>>>>>>>>>>>>>>>> don't think it provides a lot of extra value per
>> complexity
>>>>>>>>>>> (interface
>>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>> implementation) to maintain roll-ups at the thread level
>> in
>>>>>>>>>> addition
>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>> lower-level metrics.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> How about this instead:
>>>>>>>>>>>>>>>> * maintain one skipped-record metric (could be
>> per-thread,
>>>>>>>>>> per-task,
>>>>>>>>>>>>>> or
>>>>>>>>>>>>>>>> per-processor-node) with no "reason"
>>>>>>>>>>>>>>>> * introduce a warn-level log detailing the
>>>>>>> topic/partition/offset
>>>>>>>>>> and
>>>>>>>>>>>>>>>> reason of the skipped record
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> If you like that, I can update the KIP.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>> -John
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Thu, Mar 29, 2018 at 6:22 PM, Guozhang Wang <
>>>>>>>> wangguoz@gmail.com
>>>>>>>>>>>
>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> One thing you mention is the notion of setting alerts
>> on
>>>>>>> coarser
>>>>>>>>>>>>>>>> metrics
>>>>>>>>>>>>>>>>> being easier than finer ones. All the metric alerting
>>>> systems
>>>>> I
>>>>>>>>>> have
>>>>>>>>>>>>>>> used
>>>>>>>>>>>>>>>>> make it equally easy to alert on metrics by-tag or over
>>>> tags.
>>>>>>> So
>>>>>>>>>> my
>>>>>>>>>>>>>>>>> experience doesn't say that this is a use case. Were you
>>>>>>> thinking
>>>>>>>>>>>>>> of an
>>>>>>>>>>>>>>>>> alerting system that makes such a pre-aggregation
>>> valuable?
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> For the commonly used JMX reporter tags will be encoded
>>>>>>> directly
>>>>>>>>>> as
>>>>>>>>>>>>>>> part
>>>>>>>>>>>>>>>> of
>>>>>>>>>>>>>>>>> the object name, and if users wants to monitor them they
>>>> need
>>>>>>> to
>>>>>>>>>>>>>> know
>>>>>>>>>>>>>>>> these
>>>>>>>>>>>>>>>>> values before hand. That is also why I think we do want
>> to
>>>>> list
>>>>>>>>>> all
>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>> possible values of the reason tags in the KIP, since
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> In my email in response to Matthias, I gave an example
>> of
>>>> the
>>>>>>>>>>>>>> kind of
>>>>>>>>>>>>>>>>> scenario that would lead me as an operator to run with
>>> DEBUG
>>>>> on
>>>>>>>>>> all
>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>> time, since I wouldn't be sure, having seen a skipped
>>> record
>>>>>>>> once,
>>>>>>>>>>>>>> that
>>>>>>>>>>>>>>>> it
>>>>>>>>>>>>>>>>> would ever happen again. The solution is to capture all
>>> the
>>>>>>>>>>>>>> available
>>>>>>>>>>>>>>>>> information about the reason and location of skips all
>> the
>>>>>>> time.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> That is a good point. I think we can either expose all
>>>> levels
>>>>>>>>>>>>>> metrics
>>>>>>>>>>>>>>> as
>>>>>>>>>>>>>>>> by
>>>>>>>>>>>>>>>>> default, or only expose the most lower-level metrics and
>>> get
>>>>>>> rid
>>>>>>>>>> of
>>>>>>>>>>>>>>> other
>>>>>>>>>>>>>>>>> levels to let users do roll-ups themselves (which will
>> be
>>> a
>>>>>>> much
>>>>>>>>>>>>>> larger
>>>>>>>>>>>>>>>>> scope for discussion), or we can encourage users to not
>>>> purely
>>>>>>>>>>>>>> depend
>>>>>>>>>>>>>>> on
>>>>>>>>>>>>>>>>> metrics for such trouble shooting: that is to say, users
>>>> only
>>>>>>> be
>>>>>>>>>>>>>>> alerted
>>>>>>>>>>>>>>>>> based on metrics, and we can log a info / warn log4j
>> entry
>>>>> each
>>>>>>>>>>>>>> time we
>>>>>>>>>>>>>>>> are
>>>>>>>>>>>>>>>>> about to skip a record all over the places, so that upon
>>>> being
>>>>>>>>>>>>>> notified
>>>>>>>>>>>>>>>>> users can look into the logs to find the details on
>> where
>>> /
>>>>>>> when
>>>>>>>>>> it
>>>>>>>>>>>>>>>>> happens. WDYT?
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Guozhang
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> On Thu, Mar 29, 2018 at 3:57 PM, John Roesler <
>>>>>>> john@confluent.io
>>>>>>>>>
>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Hey Guozhang,
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Thanks for the review.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> 1.
>>>>>>>>>>>>>>>>>> Matthias raised the same question about the "reason"
>> tag
>>>>>>> values.
>>>>>>>>>> I
>>>>>>>>>>>>>>> can
>>>>>>>>>>>>>>>>> list
>>>>>>>>>>>>>>>>>> all possible values of the "reason" tag, but I'm
>> thinking
>>>>> this
>>>>>>>>>>>>>> level
>>>>>>>>>>>>>>> of
>>>>>>>>>>>>>>>>>> detail may not be KIP-worthy, maybe the code and
>>>>> documentation
>>>>>>>>>>>>>> review
>>>>>>>>>>>>>>>>> would
>>>>>>>>>>>>>>>>>> be sufficient. If you all disagree and would like it
>>>> included
>>>>>>> in
>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>> KIP, I
>>>>>>>>>>>>>>>>>> can certainly do that.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> If we do provide roll-up metrics, I agree with the
>>> pattern
>>>> of
>>>>>>>>>>>>>> keeping
>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>> same name but eliminating the tags for the dimensions
>>> that
>>>>>>> were
>>>>>>>>>>>>>>>>> rolled-up.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> 2.
>>>>>>>>>>>>>>>>>> I'm not too sure that implementation efficiency really
>>>>>>> becomes a
>>>>>>>>>>>>>>> factor
>>>>>>>>>>>>>>>>> in
>>>>>>>>>>>>>>>>>> choosing whether to (by default) update one coarse
>> metric
>>>> at
>>>>>>> the
>>>>>>>>>>>>>>> thread
>>>>>>>>>>>>>>>>>> level or one granular metric at the processor-node
>> level,
>>>>>>> since
>>>>>>>>>>>>>> it's
>>>>>>>>>>>>>>>> just
>>>>>>>>>>>>>>>>>> one metric being updated either way. I do agree that if
>>> we
>>>>>>> were
>>>>>>>>>> to
>>>>>>>>>>>>>>>> update
>>>>>>>>>>>>>>>>>> the granular metrics and multiple roll-ups, then we
>>> should
>>>>>>>>>>>>>> consider
>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>> efficiency.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> I agree it's probably not necessary to surface the
>>> metrics
>>>>> for
>>>>>>>>>> all
>>>>>>>>>>>>>>>> nodes
>>>>>>>>>>>>>>>>>> regardless of whether they can or do skip records.
>>> Perhaps
>>>> we
>>>>>>>> can
>>>>>>>>>>>>>>>> lazily
>>>>>>>>>>>>>>>>>> register the metrics.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> In my email in response to Matthias, I gave an example
>> of
>>>> the
>>>>>>>>>>>>>> kind of
>>>>>>>>>>>>>>>>>> scenario that would lead me as an operator to run with
>>>> DEBUG
>>>>>>> on
>>>>>>>>>>>>>> all
>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>> time, since I wouldn't be sure, having seen a skipped
>>>> record
>>>>>>>>>> once,
>>>>>>>>>>>>>>> that
>>>>>>>>>>>>>>>>> it
>>>>>>>>>>>>>>>>>> would ever happen again. The solution is to capture all
>>> the
>>>>>>>>>>>>>> available
>>>>>>>>>>>>>>>>>> information about the reason and location of skips all
>>> the
>>>>>>> time.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> One thing you mention is the notion of setting alerts
>> on
>>>>>>> coarser
>>>>>>>>>>>>>>>> metrics
>>>>>>>>>>>>>>>>>> being easier than finer ones. All the metric alerting
>>>> systems
>>>>>>> I
>>>>>>>>>>>>>> have
>>>>>>>>>>>>>>>> used
>>>>>>>>>>>>>>>>>> make it equally easy to alert on metrics by-tag or over
>>>> tags.
>>>>>>> So
>>>>>>>>>>>>>> my
>>>>>>>>>>>>>>>>>> experience doesn't say that this is a use case. Were
>> you
>>>>>>>> thinking
>>>>>>>>>>>>>> of
>>>>>>>>>>>>>>> an
>>>>>>>>>>>>>>>>>> alerting system that makes such a pre-aggregation
>>> valuable?
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Thanks again,
>>>>>>>>>>>>>>>>>> -John
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> On Thu, Mar 29, 2018 at 5:24 PM, Guozhang Wang <
>>>>>>>>>>>>>> wangguoz@gmail.com>
>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Hello John,
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Thanks for the KIP. Some comments:
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> 1. Could you list all the possible values of the
>>> "reason"
>>>>>>> tag?
>>>>>>>>>>>>>> In
>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>> JIRA
>>>>>>>>>>>>>>>>>>> ticket I left some potential reasons but I'm not clear
>>> if
>>>>>>>> you're
>>>>>>>>>>>>>>>> going
>>>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>>> categorize each of them as a separate reason, or is
>>> there
>>>>> any
>>>>>>>>>>>>>>>>> additional
>>>>>>>>>>>>>>>>>>> ones you have in mind.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Also I'm wondering if we should add another metric
>> that
>>> do
>>>>>>> not
>>>>>>>>>>>>>> have
>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>> reason tag but aggregates among all possible reasons?
>>> This
>>>>> is
>>>>>>>>>>>>>> for
>>>>>>>>>>>>>>>> users
>>>>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>>> easily set their alerting notifications (otherwise
>> they
>>>> have
>>>>>>> to
>>>>>>>>>>>>>>> write
>>>>>>>>>>>>>>>>> on
>>>>>>>>>>>>>>>>>>> notification rule per reason) in their monitoring
>>> systems.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> 2. Note that the processor-node metrics is actually
>>>>>>>> "per-thread,
>>>>>>>>>>>>>>>>>> per-task,
>>>>>>>>>>>>>>>>>>> per-processor-node", and today we only set the
>>> per-thread
>>>>>>>>>>>>>> metrics
>>>>>>>>>>>>>>> as
>>>>>>>>>>>>>>>>> INFO
>>>>>>>>>>>>>>>>>>> while leaving the lower two layers as DEBUG. I agree
>>> with
>>>>>>> your
>>>>>>>>>>>>>>>> argument
>>>>>>>>>>>>>>>>>>> that we are missing the per-client roll-up metrics
>>> today,
>>>>> but
>>>>>>>>>>>>>> I'm
>>>>>>>>>>>>>>>>>> convinced
>>>>>>>>>>>>>>>>>>> that the right way to approach it would be
>>>>>>>>>>>>>>>> "just-providing-the-lowest-
>>>>>>>>>>>>>>>>>>> level
>>>>>>>>>>>>>>>>>>> metrics only".
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Note the recoding implementation of these three levels
>>> are
>>>>>>>>>>>>>>> different
>>>>>>>>>>>>>>>>>>> internally today: we did not just do the rolling up to
>>>>>>> generate
>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>> higher-level metrics from the lower level ones, but we
>>>> just
>>>>>>>>>>>>>> record
>>>>>>>>>>>>>>>> them
>>>>>>>>>>>>>>>>>>> separately, which means that, if we turn on multiple
>>>> levels
>>>>>>> of
>>>>>>>>>>>>>>>> metrics,
>>>>>>>>>>>>>>>>>> we
>>>>>>>>>>>>>>>>>>> maybe duplicate collecting some metrics. One can argue
>>>> that
>>>>>>> is
>>>>>>>>>>>>>> not
>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>> best
>>>>>>>>>>>>>>>>>>> way to represent multi-level metrics collecting and
>>>>>>> reporting,
>>>>>>>>>>>>>> but
>>>>>>>>>>>>>>> by
>>>>>>>>>>>>>>>>>> only
>>>>>>>>>>>>>>>>>>> enabling thread-level metrics as INFO today, that
>>>>>>>> implementation
>>>>>>>>>>>>>>>> could
>>>>>>>>>>>>>>>>> be
>>>>>>>>>>>>>>>>>>> more efficient than only collecting the metrics at the
>>>>> lowest
>>>>>>>>>>>>>>> level,
>>>>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>>>>> then do the roll-up calculations outside of the
>> metrics
>>>>>>>> classes.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Plus, today not all processor-nodes may possibly skip
>>>>>>> records,
>>>>>>>>>>>>>>> AFAIK
>>>>>>>>>>>>>>>> we
>>>>>>>>>>>>>>>>>>> will only skip records at the source, sink, window and
>>>>>>>>>>>>>> aggregation
>>>>>>>>>>>>>>>>>>> processor nodes, so adding a metric per processor
>> looks
>>>> like
>>>>>>> an
>>>>>>>>>>>>>>>>> overkill
>>>>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>>> me as well. On the other hand, from user's perspective
>>> the
>>>>>>>>>>>>>> "reason"
>>>>>>>>>>>>>>>> tag
>>>>>>>>>>>>>>>>>> may
>>>>>>>>>>>>>>>>>>> be sufficient for them to narrow down where inside the
>>>>>>> topology
>>>>>>>>>>>>>> is
>>>>>>>>>>>>>>>>>> causing
>>>>>>>>>>>>>>>>>>> records to be dropped on the floor. So I think the
>>>>>>> "per-thread,
>>>>>>>>>>>>>>>>> per-task"
>>>>>>>>>>>>>>>>>>> level metrics should be sufficient for them in trouble
>>>> shoot
>>>>>>> in
>>>>>>>>>>>>>>> DEBUG
>>>>>>>>>>>>>>>>>> mode,
>>>>>>>>>>>>>>>>>>> and we can add another "per-thread" level metrics as
>>> INFO
>>>>>>> which
>>>>>>>>>>>>>> is
>>>>>>>>>>>>>>>>> turned
>>>>>>>>>>>>>>>>>>> on by default. So under normal execution users still
>>> only
>>>>>>> need
>>>>>>>>>>>>>> INFO
>>>>>>>>>>>>>>>>> level
>>>>>>>>>>>>>>>>>>> metrics for alerting (e.g. set alerts on all
>>>> skipped-records
>>>>>>>>>>>>>>> metrics
>>>>>>>>>>>>>>>> as
>>>>>>>>>>>>>>>>>>> non-zero), and then upon trouble shooting they can
>> turn
>>> on
>>>>>>>> DEBUG
>>>>>>>>>>>>>>>>> metrics
>>>>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>>> look into which task is actually causing the skipped
>>>>> records.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Guozhang
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> On Thu, Mar 29, 2018 at 2:03 PM, Matthias J. Sax <
>>>>>>>>>>>>>>>>> matthias@confluent.io>
>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Thanks for the KIP John.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Reading the material on the related Jiras, I am
>>> wondering
>>>>>>> what
>>>>>>>>>>>>>>>>> `reason`
>>>>>>>>>>>>>>>>>>>> tags you want to introduce? Can you elaborate? The
>> KIP
>>>>>>> should
>>>>>>>>>>>>>>> list
>>>>>>>>>>>>>>>>>> those
>>>>>>>>>>>>>>>>>>>> IMHO.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> About the fine grained metrics vs the roll-up: you
>> say
>>>> that
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> the coarse metric aggregates across two dimensions
>>>>>>>>>>>>>>> simultaneously
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Can you elaborate why this is an issue? I am not
>>>> convinced
>>>>>>> atm
>>>>>>>>>>>>>>> that
>>>>>>>>>>>>>>>>> we
>>>>>>>>>>>>>>>>>>>> should put the fine grained metrics into INFO level
>> and
>>>>>>> remove
>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>> roll-up at thread level.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> Given that they have to do this sum to get a usable
>>>>>>>>>>>>>> top-level
>>>>>>>>>>>>>>>> view
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> This is a fair concern, but I don't share the
>>> conclusion.
>>>>>>>>>>>>>>> Offering
>>>>>>>>>>>>>>>> a
>>>>>>>>>>>>>>>>>>>> built-in `KafkaStreams` "client" roll-up out of the
>> box
>>>>>>> might
>>>>>>>>>>>>>> be
>>>>>>>>>>>>>>> a
>>>>>>>>>>>>>>>>>>>> better solution. In the past we did not offer this
>> due
>>> to
>>>>>>>>>>>>>>>> performance
>>>>>>>>>>>>>>>>>>>> concerns, but we could allow an "opt-in" mechanism.
>> If
>>>> you
>>>>>>>>>>>>>>>> disagree,
>>>>>>>>>>>>>>>>>> can
>>>>>>>>>>>>>>>>>>>> you provide some reasoning and add them to the
>>> "Rejected
>>>>>>>>>>>>>>>>> alternatives"
>>>>>>>>>>>>>>>>>>>> section.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> To rephrase: I understand the issue about missing
>>>> top-level
>>>>>>>>>>>>>> view,
>>>>>>>>>>>>>>>> but
>>>>>>>>>>>>>>>>>>>> instead of going more fine grained, we should
>> consider
>>> to
>>>>>>> add
>>>>>>>>>>>>>>> this
>>>>>>>>>>>>>>>>>>>> top-level view and add/keep the fine grained metrics
>> at
>>>>>>> DEBUG
>>>>>>>>>>>>>>> level
>>>>>>>>>>>>>>>>>> only
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> I am +1 to add TopologyTestDriver#metrics() and to
>>> remove
>>>>>>> old
>>>>>>>>>>>>>>>> metrics
>>>>>>>>>>>>>>>>>>>> directly as you suggested.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> -Matthias
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> On 3/28/18 6:42 PM, Ted Yu wrote:
>>>>>>>>>>>>>>>>>>>>> Looks good to me.
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> On Wed, Mar 28, 2018 at 3:11 PM, John Roesler <
>>>>>>>>>>>>>>> john@confluent.io
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> Hello all,
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> I am proposing KIP-274 to improve the metrics
>> around
>>>>>>>>>>>>>> skipped
>>>>>>>>>>>>>>>>> records
>>>>>>>>>>>>>>>>>>> in
>>>>>>>>>>>>>>>>>>>>>> Streams.
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> Please find the details here:
>>>>>>>>>>>>>>>>>>>>>> https://cwiki.apache.org/
>>> confluence/display/KAFKA/KIP-
>>>>>>>>>>>>>>>>>>>>>> 274%3A+Kafka+Streams+Skipped+Records+Metrics
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> Please let me know what you think!
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>>>>>>> -John
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>>>>> -- Guozhang
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>>> -- Guozhang
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>> -- Guozhang
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> -- Guozhang
>>>>
>>>
>>
> 
> 
>

Re: [DISCUSS] KIP-274: Kafka Streams Skipped Records Metrics

Posted by Guozhang Wang <wa...@gmail.com>.

Thanks John, your proposal looks fine to me.

I'll go ahead and look into the PR for more details myself.


Guozhang

On Tue, Apr 3, 2018 at 1:35 PM, Bill Bejeck <bb...@gmail.com> wrote:

> Hi John,
>
> Thanks for making the updates.
>
> I agree with the information you've included in the logs as described
> above, as log statements without enough context/information can be
> frustrating.
>
> -Bill
>
> On Tue, Apr 3, 2018 at 3:29 PM, John Roesler <jo...@confluent.io> wrote:
>
> > Allrighty, how about this, then...
> >
> > I'll move the metric back to the StreamThread and maintain the existing
> tag
> > (client-id=...(per-thread client-id)). It won't be present in the
> > TopologyTestDriver's metrics.
> >
> > As a side note, I'm not sure that the location of the log messages has
> > visibility into the name of the thread or the task, or the processor
> node,
> > for that matter. But at the end of the day, I don't think it really
> > matters.
> >
> > None of those identifiers are in the public interface or user-controlled.
> > For them to be useful for debugging, users would have to gain a very deep
> > understanding of how their DSL program gets executed. From my
> perspective,
> > they are all included in metric tags only to prevent collisions between
> the
> > same metrics in different (e.g.) threads.
> >
> > I think what's important is to provide the right information in the logs
> > that users will be able to debug their issues. This is why the logs in my
> > pr include the topic/partition/offset of the offending data, as well as
> the
> > stacktrace of the exception from the deserializer (or for timestamps, the
> > extracted timestamp and the class name of their extractor). This
> > information alone should let them pinpoint the offending data and fix it.
> >
> > (I am aware that that topic name might be a repartition topic, and
> > therefore also esoteric from the user's perspective, but I think it's the
> > best we can do right now. It might be nice to explicitly take on a
> > debugging ergonomics task in the future and give all processor nodes
> > human-friendly names. Then, we could surface these names in any logs or
> > exceptions. But I'm inclined to call this out-of-scope for now.)
> >
> > Thanks again,
> > -John
> >
> > On Tue, Apr 3, 2018 at 1:40 PM, Guozhang Wang <wa...@gmail.com>
> wrote:
> >
> > > 1. If we can indeed gather all the context information from the log4j
> > > entries I'd suggest we change to thread-level (I'm not sure if that is
> > > doable, so if John have already some WIP PR that can help us decide).
> > >
> > > 2. We can consider adding the API in TopologyTestDriver for general
> > testing
> > > purposes; that being said, I think Matthias has a good point that this
> > > alone should not be a driving motivation for us to keep this metric as
> > > task-level if 1) is true.
> > >
> > >
> > >
> > > Guozhang
> > >
> > >
> > > On Tue, Apr 3, 2018 at 11:36 AM, Matthias J. Sax <
> matthias@confluent.io>
> > > wrote:
> > >
> > > > Thanks Guozhang, that was my intent.
> > > >
> > > > @John: yes, we should not nail down the exact log message. It's just
> to
> > > > point out the trade-off. If we can get the required information in
> the
> > > > logs, we might not need task level metrics.
> > > >
> > > >
> > > > -Matthias
> > > >
> > > > On 4/3/18 11:26 AM, Guozhang Wang wrote:
> > > > > I think Matthias' comment is that, we can still record the metrics
> on
> > > the
> > > > > thread-level, while having the WARN log entry to include sufficient
> > > > context
> > > > > information so that users can still easily narrow down the
> > > investigation
> > > > > scope.
> > > > >
> > > > >
> > > > > Guozhang
> > > > >
> > > > > On Tue, Apr 3, 2018 at 11:22 AM, John Roesler <jo...@confluent.io>
> > > wrote:
> > > > >
> > > > >> I agree we should add as much information as is reasonable to the
> > log.
> > > > For
> > > > >> example, see this WIP PR I started for this KIP:
> > > > >>
> > > > >> https://github.com/apache/kafka/pull/4812/files#diff-
> > > > >> 88d129f048bc842c7db5b2566a45fce8R80
> > > > >>
> > > > >> and
> > > > >>
> > > > >> https://github.com/apache/kafka/pull/4812/files#diff-
> > > > >> 69e6789eb675ec978a1abd24fed96eb1R111
> > > > >>
> > > > >> I'm not sure if we should nail down the log messages in the KIP or
> > in
> > > > the
> > > > >> PR discussion. What say you?
> > > > >>
> > > > >> Thanks,
> > > > >> -John
> > > > >>
> > > > >> On Tue, Apr 3, 2018 at 12:20 AM, Matthias J. Sax <
> > > matthias@confluent.io
> > > > >
> > > > >> wrote:
> > > > >>
> > > > >>> Thanks for sharing your thoughts. As I mentioned originally, I am
> > not
> > > > >>> sure about the right log level either. Your arguments are
> > convincing
> > > --
> > > > >>> thus, I am fine with keeping WARN level.
> > > > >>>
> > > > >>> The task vs thread level argument is an interesting one.
> However, I
> > > am
> > > > >>> wondering if we should add this information into the
> corresponding
> > > WARN
> > > > >>> logs that we write anyway? For this case, we can also log the
> > > > >>> corresponding operator (and other information like topic name etc
> > if
> > > > >>> needed). WDYT about this?
> > > > >>>
> > > > >>>
> > > > >>> -Matthias
> > > > >>>
> > > > >>> On 4/2/18 8:31 PM, Guozhang Wang wrote:
> > > > >>>> Regarding logging: I'm inclined to keep logging at WARN level
> > since
> > > > >>> skipped
> > > > >>>> records are not expected in normal execution (for all reasons
> that
> > > we
> > > > >> are
> > > > >>>> aware of), and hence when error happens users should be alerted
> > from
> > > > >>>> metrics and looked into the log files, so to me if it is really
> > > > >> spamming
> > > > >>>> the log files it is also a good alert for users. Besides for
> > > > >> deserialize
> > > > >>>> errors we already log at WARN level for this reason.
> > > > >>>>
> > > > >>>> Regarding the metrics-levels: I was pondering on that as well.
> > What
> > > > >> made
> > > > >>> me
> > > > >>>> to think and agree on task-level than thread-level is that for
> > some
> > > > >>> reasons
> > > > >>>> like window retention, they may possibly be happening on a
> subset
> > of
> > > > >>> input
> > > > >>>> partitions, and tasks are correlated with partitions the
> > task-level
> > > > >>> metrics
> > > > >>>> can help users to narrow down on the specific input data
> > partitions.
> > > > >>>>
> > > > >>>>
> > > > >>>> Guozhang
> > > > >>>>
> > > > >>>>
> > > > >>>> On Mon, Apr 2, 2018 at 6:43 PM, John Roesler <john@confluent.io
> >
> > > > >> wrote:
> > > > >>>>
> > > > >>>>> Hi Matthias,
> > > > >>>>>
> > > > >>>>> No worries! Thanks for the reply.
> > > > >>>>>
> > > > >>>>> 1) There isn't a connection. I tried using the
> TopologyTestDriver
> > > to
> > > > >>> write
> > > > >>>>> a quick test exercising the current behavior and discovered
> that
> > > the
> > > > >>>>> metrics weren't available. It seemed like they should be, so I
> > > tacked
> > > > >>> it on
> > > > >>>>> to this KIP. If you feel it's inappropriate, I can pull it back
> > > out.
> > > > >>>>>
> > > > >>>>> 2) I was also concerned about that, but I figured it would come
> > up
> > > in
> > > > >>>>> discussion if I just went ahead and proposed it. And here we
> are!
> > > > >>>>>
> > > > >>>>> Here's my thought: maybe there are two classes of skips:
> > > "controlled"
> > > > >>> and
> > > > >>>>> "uncontrolled", where "controlled" means, as an app author, I
> > > > >>> deliberately
> > > > >>>>> filter out some events, and "uncontrolled" means that I simply
> > > don't
> > > > >>>>> account for some feature of the data, and the framework skips
> > them
> > > > (as
> > > > >>>>> opposed to crashing).
> > > > >>>>>
> > > > >>>>> In this breakdowns, the skips I'm adding metrics for are all
> > > > >>> uncontrolled
> > > > >>>>> skips (and we hope to measure all the uncontrolled skips). Our
> > > skips
> > > > >> are
> > > > >>>>> well documented, so it wouldn't be terrible to have an
> > application
> > > in
> > > > >>> which
> > > > >>>>> you know you expect to have tons of uncontrolled skips, but
> it's
> > > not
> > > > >>> great
> > > > >>>>> either, since you may also have some *unexpected* uncontrolled
> > > skips.
> > > > >>> It'll
> > > > >>>>> be difficult to notice, since you're probably not alerting on
> the
> > > > >> metric
> > > > >>>>> and filtering out the logs (whatever their level).
> > > > >>>>>
> > > > >>>>> I'd recommend any app author, as an alternative, to convert all
> > > > >> expected
> > > > >>>>> skips to controlled ones, by updating the topology to filter
> > those
> > > > >>> records
> > > > >>>>> out.
> > > > >>>>>
> > > > >>>>> Following from my recommendation, as a library author, I'm
> > inclined
> > > > to
> > > > >>> mark
> > > > >>>>> those logs WARN, since in my opinion, they should be concerning
> > to
> > > > the
> > > > >>> app
> > > > >>>>> authors. I'd definitely want to show, rather than hide, them by
> > > > >>> default, so
> > > > >>>>> I would pick INFO at least.
> > > > >>>>>
> > > > >>>>> That said, logging is always a tricky issue for lower-level
> > > libraries
> > > > >>> that
> > > > >>>>> run inside user code, since we don't have all the information
> we
> > > need
> > > > >> to
> > > > >>>>> make the right call.
> > > > >>>>>
> > > > >>>>>
> > > > >>>>>
> > > > >>>>> On your last note, yeah, I got that impression from Guozhang as
> > > well.
> > > > >>>>> Thanks for the clarification.
> > > > >>>>>
> > > > >>>>> -John
> > > > >>>>>
> > > > >>>>>
> > > > >>>>>
> > > > >>>>> On Mon, Apr 2, 2018 at 4:03 PM, Matthias J. Sax <
> > > > >> matthias@confluent.io>
> > > > >>>>> wrote:
> > > > >>>>>
> > > > >>>>>> John,
> > > > >>>>>>
> > > > >>>>>> sorry for my late reply and thanks for updating the KIP.
> > > > >>>>>>
> > > > >>>>>> I like your approach about "metrics are for monitoring, logs
> are
> > > for
> > > > >>>>>> debugging" -- however:
> > > > >>>>>>
> > > > >>>>>> 1) I don't see a connection between this and the task-level
> > > metrics
> > > > >>> that
> > > > >>>>>> you propose to get the metrics in `TopologyTestDriver`. I
> don't
> > > > think
> > > > >>>>>> people would monitor the `TopologyTestDriver` an thus
> wondering
> > > why
> > > > >> it
> > > > >>>>>> is important to include the metrics there? Thread-level metric
> > > might
> > > > >> be
> > > > >>>>>> easier to monitor though (ie, less different metric to
> monitor).
> > > > >>>>>>
> > > > >>>>>> 2) I am a little worried about WARN level logging and that it
> > > might
> > > > >> be
> > > > >>>>>> too chatty -- as you pointed out, it's about debugging, thus
> > DEBUG
> > > > >>> level
> > > > >>>>>> might be better. Not 100% sure about this to be honest. What
> is
> > > the
> > > > >>>>>> general assumption about the frequency for skipped records? I
> > > could
> > > > >>>>>> imagine cases for which skipped records are quite frequent and
> > > thus,
> > > > >>>>>> WARN level logs might "flood" the logs
> > > > >>>>>>
> > > > >>>>>> One final remark:
> > > > >>>>>>
> > > > >>>>>>> More
> > > > >>>>>>> generally, I would like to establish a pattern in which we
> > could
> > > > add
> > > > >>>>> new
> > > > >>>>>>> values for the "reason" tags without needing a KIP to do so.
> > > > >>>>>>
> > > > >>>>>> From my understanding, this is not feasible. Changing metrics
> is
> > > > >> always
> > > > >>>>>> considered a public API change, and we need a KIP for any
> > change.
> > > As
> > > > >> we
> > > > >>>>>> moved away from tagging, it doesn't matter for the KIP anymore
> > --
> > > > >> just
> > > > >>>>>> wanted to point it out.
> > > > >>>>>>
> > > > >>>>>>
> > > > >>>>>> -Matthias
> > > > >>>>>>
> > > > >>>>>>
> > > > >>>>>> On 3/30/18 2:47 PM, John Roesler wrote:
> > > > >>>>>>> Allrighty! The KIP is updated.
> > > > >>>>>>>
> > > > >>>>>>> Thanks again, all, for the feedback.
> > > > >>>>>>> -John
> > > > >>>>>>>
> > > > >>>>>>> On Fri, Mar 30, 2018 at 3:35 PM, John Roesler <
> > john@confluent.io
> > > >
> > > > >>>>> wrote:
> > > > >>>>>>>
> > > > >>>>>>>> Hey Guozhang and Bill,
> > > > >>>>>>>>
> > > > >>>>>>>> Ok, I'll update the KIP. At the risk of disturbing
> consensus,
> > > I'd
> > > > >>> like
> > > > >>>>>> to
> > > > >>>>>>>> put it in the task instead of the thread so that it'll show
> up
> > > in
> > > > >> the
> > > > >>>>>>>> TopologyTestDriver metrics as well.
> > > > >>>>>>>>
> > > > >>>>>>>> I'm leaning toward keeping the scope where it is right now,
> > but
> > > if
> > > > >>>>>> others
> > > > >>>>>>>> want to advocate for tossing in some more metrics, we can go
> > > that
> > > > >>>>> route.
> > > > >>>>>>>>
> > > > >>>>>>>> Thanks all,
> > > > >>>>>>>> -John
> > > > >>>>>>>>
> > > > >>>>>>>> On Fri, Mar 30, 2018 at 2:37 PM, Bill Bejeck <
> > bbejeck@gmail.com
> > > >
> > > > >>>>> wrote:
> > > > >>>>>>>>
> > > > >>>>>>>>> Thanks for the KIP John, and sorry for the late comments.
> > > > >>>>>>>>>
> > > > >>>>>>>>> I'm on the fence with providing a single level metrics,
> but I
> > > > >> think
> > > > >>>>>> we'll
> > > > >>>>>>>>> have that discussion outside of this KIP.
> > > > >>>>>>>>>
> > > > >>>>>>>>>> * maintain one skipped-record metric (could be per-thread,
> > > > >>> per-task,
> > > > >>>>>> or
> > > > >>>>>>>>>> per-processor-node) with no "reason"
> > > > >>>>>>>>>> * introduce a warn-level log detailing the
> > > > topic/partition/offset
> > > > >>>>> and
> > > > >>>>>>>>>> reason of the skipped record
> > > > >>>>>>>>>
> > > > >>>>>>>>> I'm +1 on both of these suggestions.
> > > > >>>>>>>>>
> > > > >>>>>>>>> Finally, we have had requests in the past for some metrics
> > > around
> > > > >>>>> when
> > > > >>>>>>>>> persistent store removes an expired window.  Would adding
> > that
> > > to
> > > > >>> our
> > > > >>>>>>>>> metrics stretch the scope of this KIP too much?
> > > > >>>>>>>>>
> > > > >>>>>>>>> Thanks again and overall I'm +1 on this KIP
> > > > >>>>>>>>>
> > > > >>>>>>>>> Bill
> > > > >>>>>>>>>
> > > > >>>>>>>>> On Fri, Mar 30, 2018 at 2:00 PM, Guozhang Wang <
> > > > >> wangguoz@gmail.com>
> > > > >>>>>>>>> wrote:
> > > > >>>>>>>>>
> > > > >>>>>>>>>> The proposal sounds good to me. About "maintain only one
> > level
> > > > of
> > > > >>>>>>>>> metrics"
> > > > >>>>>>>>>> maybe we can discuss about that separately from this KIP
> > since
> > > > >> that
> > > > >>>>>>>>> would
> > > > >>>>>>>>>> be a larger scope of discussion. I agree that if we are
> > going
> > > to
> > > > >>>>>>>>> maintain
> > > > >>>>>>>>>> only one-level metrics it should be lowest level and we
> > would
> > > > let
> > > > >>>>>> users
> > > > >>>>>>>>> to
> > > > >>>>>>>>>> do the roll-ups themselves, but I'm still not fully
> > convinced
> > > > >> that
> > > > >>>>> we
> > > > >>>>>>>>>> should just provide single-level metrics, because 1) I
> think
> > > for
> > > > >>>>>>>>> different
> > > > >>>>>>>>>> metrics people may be interested to investigate into
> > different
> > > > >>>>>>>>>> granularities, e.g. for poll / commit rate these are at
> the
> > > > >> lowest
> > > > >>>>>>>>>> task-level metrics, while for process-rate / skip-rate
> they
> > > can
> > > > >> be
> > > > >>>>> as
> > > > >>>>>>>>> low
> > > > >>>>>>>>>> as processor-node metrics, and 2) user-side rolling ups
> may
> > > not
> > > > >> be
> > > > >>>>>> very
> > > > >>>>>>>>>> straight-forward. But for 2) if someone can provide an
> > > efficient
> > > > >>> and
> > > > >>>>>>>>> easy
> > > > >>>>>>>>>> implementation of that I can be persuaded :)
> > > > >>>>>>>>>>
> > > > >>>>>>>>>> For now I'm thinking we can add the metric on
> thread-level,
> > > > >> either
> > > > >>>>>> with
> > > > >>>>>>>>>> finer grained ones with "reason" tag plus an aggregated
> one
> > > > >> without
> > > > >>>>>> the
> > > > >>>>>>>>>> tag, or just having a single aggregated metric without the
> > tag
> > > > >>> looks
> > > > >>>>>>>>> good
> > > > >>>>>>>>>> to me.
> > > > >>>>>>>>>>
> > > > >>>>>>>>>>
> > > > >>>>>>>>>> Guozhang
> > > > >>>>>>>>>>
> > > > >>>>>>>>>> On Fri, Mar 30, 2018 at 8:05 AM, John Roesler <
> > > > john@confluent.io
> > > > >>>
> > > > >>>>>>>>> wrote:
> > > > >>>>>>>>>>
> > > > >>>>>>>>>>> Hey Guozhang,
> > > > >>>>>>>>>>>
> > > > >>>>>>>>>>> Thanks for the reply. Regarding JMX, I can dig it. I'll
> > > provide
> > > > >> a
> > > > >>>>>>>>> list in
> > > > >>>>>>>>>>> the KIP. I was also thinking we'd better start a
> > > documentation
> > > > >>> page
> > > > >>>>>>>>> with
> > > > >>>>>>>>>>> the metrics listed.
> > > > >>>>>>>>>>>
> > > > >>>>>>>>>>> I'd have no problem logging a warning when we skip
> records.
> > > On
> > > > >> the
> > > > >>>>>>>>> metric
> > > > >>>>>>>>>>> front, really I'm just pushing for us to maintain only
> one
> > > > level
> > > > >>> of
> > > > >>>>>>>>>>> metrics. If that's more or less granular (i.e., maybe we
> > > don't
> > > > >>>>> have a
> > > > >>>>>>>>>>> metric per reason and log the reason instead), that's
> fine
> > by
> > > > >> me.
> > > > >>> I
> > > > >>>>>>>>> just
> > > > >>>>>>>>>>> don't think it provides a lot of extra value per
> complexity
> > > > >>>>>> (interface
> > > > >>>>>>>>>> and
> > > > >>>>>>>>>>> implementation) to maintain roll-ups at the thread level
> in
> > > > >>>>> addition
> > > > >>>>>>>>> to
> > > > >>>>>>>>>>> lower-level metrics.
> > > > >>>>>>>>>>>
> > > > >>>>>>>>>>> How about this instead:
> > > > >>>>>>>>>>> * maintain one skipped-record metric (could be
> per-thread,
> > > > >>>>> per-task,
> > > > >>>>>>>>> or
> > > > >>>>>>>>>>> per-processor-node) with no "reason"
> > > > >>>>>>>>>>> * introduce a warn-level log detailing the
> > > > >> topic/partition/offset
> > > > >>>>> and
> > > > >>>>>>>>>>> reason of the skipped record
> > > > >>>>>>>>>>>
> > > > >>>>>>>>>>> If you like that, I can update the KIP.
> > > > >>>>>>>>>>>
> > > > >>>>>>>>>>> Thanks,
> > > > >>>>>>>>>>> -John
> > > > >>>>>>>>>>>
> > > > >>>>>>>>>>>
> > > > >>>>>>>>>>>
> > > > >>>>>>>>>>> On Thu, Mar 29, 2018 at 6:22 PM, Guozhang Wang <
> > > > >>> wangguoz@gmail.com
> > > > >>>>>>
> > > > >>>>>>>>>> wrote:
> > > > >>>>>>>>>>>
> > > > >>>>>>>>>>>>> One thing you mention is the notion of setting alerts
> on
> > > > >> coarser
> > > > >>>>>>>>>>> metrics
> > > > >>>>>>>>>>>> being easier than finer ones. All the metric alerting
> > > systems
> > > > I
> > > > >>>>> have
> > > > >>>>>>>>>> used
> > > > >>>>>>>>>>>> make it equally easy to alert on metrics by-tag or over
> > > tags.
> > > > >> So
> > > > >>>>> my
> > > > >>>>>>>>>>>> experience doesn't say that this is a use case. Were you
> > > > >> thinking
> > > > >>>>>>>>> of an
> > > > >>>>>>>>>>>> alerting system that makes such a pre-aggregation
> > valuable?
> > > > >>>>>>>>>>>>
> > > > >>>>>>>>>>>> For the commonly used JMX reporter tags will be encoded
> > > > >> directly
> > > > >>>>> as
> > > > >>>>>>>>>> part
> > > > >>>>>>>>>>> of
> > > > >>>>>>>>>>>> the object name, and if users wants to monitor them they
> > > need
> > > > >> to
> > > > >>>>>>>>> know
> > > > >>>>>>>>>>> these
> > > > >>>>>>>>>>>> values before hand. That is also why I think we do want
> to
> > > > list
> > > > >>>>> all
> > > > >>>>>>>>> the
> > > > >>>>>>>>>>>> possible values of the reason tags in the KIP, since
> > > > >>>>>>>>>>>>
> > > > >>>>>>>>>>>>> In my email in response to Matthias, I gave an example
> of
> > > the
> > > > >>>>>>>>> kind of
> > > > >>>>>>>>>>>> scenario that would lead me as an operator to run with
> > DEBUG
> > > > on
> > > > >>>>> all
> > > > >>>>>>>>> the
> > > > >>>>>>>>>>>> time, since I wouldn't be sure, having seen a skipped
> > record
> > > > >>> once,
> > > > >>>>>>>>> that
> > > > >>>>>>>>>>> it
> > > > >>>>>>>>>>>> would ever happen again. The solution is to capture all
> > the
> > > > >>>>>>>>> available
> > > > >>>>>>>>>>>> information about the reason and location of skips all
> the
> > > > >> time.
> > > > >>>>>>>>>>>>
> > > > >>>>>>>>>>>> That is a good point. I think we can either expose all
> > > levels
> > > > >>>>>>>>> metrics
> > > > >>>>>>>>>> as
> > > > >>>>>>>>>>> by
> > > > >>>>>>>>>>>> default, or only expose the most lower-level metrics and
> > get
> > > > >> rid
> > > > >>>>> of
> > > > >>>>>>>>>> other
> > > > >>>>>>>>>>>> levels to let users do roll-ups themselves (which will
> be
> > a
> > > > >> much
> > > > >>>>>>>>> larger
> > > > >>>>>>>>>>>> scope for discussion), or we can encourage users to not
> > > purely
> > > > >>>>>>>>> depend
> > > > >>>>>>>>>> on
> > > > >>>>>>>>>>>> metrics for such trouble shooting: that is to say, users
> > > only
> > > > >> be
> > > > >>>>>>>>>> alerted
> > > > >>>>>>>>>>>> based on metrics, and we can log a info / warn log4j
> entry
> > > > each
> > > > >>>>>>>>> time we
> > > > >>>>>>>>>>> are
> > > > >>>>>>>>>>>> about to skip a record all over the places, so that upon
> > > being
> > > > >>>>>>>>> notified
> > > > >>>>>>>>>>>> users can look into the logs to find the details on
> where
> > /
> > > > >> when
> > > > >>>>> it
> > > > >>>>>>>>>>>> happens. WDYT?
> > > > >>>>>>>>>>>>
> > > > >>>>>>>>>>>>
> > > > >>>>>>>>>>>> Guozhang
> > > > >>>>>>>>>>>>
> > > > >>>>>>>>>>>>
> > > > >>>>>>>>>>>>
> > > > >>>>>>>>>>>> On Thu, Mar 29, 2018 at 3:57 PM, John Roesler <
> > > > >> john@confluent.io
> > > > >>>>
> > > > >>>>>>>>>> wrote:
> > > > >>>>>>>>>>>>
> > > > >>>>>>>>>>>>> Hey Guozhang,
> > > > >>>>>>>>>>>>>
> > > > >>>>>>>>>>>>> Thanks for the review.
> > > > >>>>>>>>>>>>>
> > > > >>>>>>>>>>>>> 1.
> > > > >>>>>>>>>>>>> Matthias raised the same question about the "reason"
> tag
> > > > >> values.
> > > > >>>>> I
> > > > >>>>>>>>>> can
> > > > >>>>>>>>>>>> list
> > > > >>>>>>>>>>>>> all possible values of the "reason" tag, but I'm
> thinking
> > > > this
> > > > >>>>>>>>> level
> > > > >>>>>>>>>> of
> > > > >>>>>>>>>>>>> detail may not be KIP-worthy, maybe the code and
> > > > documentation
> > > > >>>>>>>>> review
> > > > >>>>>>>>>>>> would
> > > > >>>>>>>>>>>>> be sufficient. If you all disagree and would like it
> > > included
> > > > >> in
> > > > >>>>>>>>> the
> > > > >>>>>>>>>>>> KIP, I
> > > > >>>>>>>>>>>>> can certainly do that.
> > > > >>>>>>>>>>>>>
> > > > >>>>>>>>>>>>> If we do provide roll-up metrics, I agree with the
> > pattern
> > > of
> > > > >>>>>>>>> keeping
> > > > >>>>>>>>>>> the
> > > > >>>>>>>>>>>>> same name but eliminating the tags for the dimensions
> > that
> > > > >> were
> > > > >>>>>>>>>>>> rolled-up.
> > > > >>>>>>>>>>>>>
> > > > >>>>>>>>>>>>> 2.
> > > > >>>>>>>>>>>>> I'm not too sure that implementation efficiency really
> > > > >> becomes a
> > > > >>>>>>>>>> factor
> > > > >>>>>>>>>>>> in
> > > > >>>>>>>>>>>>> choosing whether to (by default) update one coarse
> metric
> > > at
> > > > >> the
> > > > >>>>>>>>>> thread
> > > > >>>>>>>>>>>>> level or one granular metric at the processor-node
> level,
> > > > >> since
> > > > >>>>>>>>> it's
> > > > >>>>>>>>>>> just
> > > > >>>>>>>>>>>>> one metric being updated either way. I do agree that if
> > we
> > > > >> were
> > > > >>>>> to
> > > > >>>>>>>>>>> update
> > > > >>>>>>>>>>>>> the granular metrics and multiple roll-ups, then we
> > should
> > > > >>>>>>>>> consider
> > > > >>>>>>>>>> the
> > > > >>>>>>>>>>>>> efficiency.
> > > > >>>>>>>>>>>>>
> > > > >>>>>>>>>>>>> I agree it's probably not necessary to surface the
> > metrics
> > > > for
> > > > >>>>> all
> > > > >>>>>>>>>>> nodes
> > > > >>>>>>>>>>>>> regardless of whether they can or do skip records.
> > Perhaps
> > > we
> > > > >>> can
> > > > >>>>>>>>>>> lazily
> > > > >>>>>>>>>>>>> register the metrics.
> > > > >>>>>>>>>>>>>
> > > > >>>>>>>>>>>>> In my email in response to Matthias, I gave an example
> of
> > > the
> > > > >>>>>>>>> kind of
> > > > >>>>>>>>>>>>> scenario that would lead me as an operator to run with
> > > DEBUG
> > > > >> on
> > > > >>>>>>>>> all
> > > > >>>>>>>>>> the
> > > > >>>>>>>>>>>>> time, since I wouldn't be sure, having seen a skipped
> > > record
> > > > >>>>> once,
> > > > >>>>>>>>>> that
> > > > >>>>>>>>>>>> it
> > > > >>>>>>>>>>>>> would ever happen again. The solution is to capture all
> > the
> > > > >>>>>>>>> available
> > > > >>>>>>>>>>>>> information about the reason and location of skips all
> > the
> > > > >> time.
> > > > >>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>
> > > > >>>>>>>>>>>>> One thing you mention is the notion of setting alerts
> on
> > > > >> coarser
> > > > >>>>>>>>>>> metrics
> > > > >>>>>>>>>>>>> being easier than finer ones. All the metric alerting
> > > systems
> > > > >> I
> > > > >>>>>>>>> have
> > > > >>>>>>>>>>> used
> > > > >>>>>>>>>>>>> make it equally easy to alert on metrics by-tag or over
> > > tags.
> > > > >> So
> > > > >>>>>>>>> my
> > > > >>>>>>>>>>>>> experience doesn't say that this is a use case. Were
> you
> > > > >>> thinking
> > > > >>>>>>>>> of
> > > > >>>>>>>>>> an
> > > > >>>>>>>>>>>>> alerting system that makes such a pre-aggregation
> > valuable?
> > > > >>>>>>>>>>>>>
> > > > >>>>>>>>>>>>> Thanks again,
> > > > >>>>>>>>>>>>> -John
> > > > >>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>
> > > > >>>>>>>>>>>>> On Thu, Mar 29, 2018 at 5:24 PM, Guozhang Wang <
> > > > >>>>>>>>> wangguoz@gmail.com>
> > > > >>>>>>>>>>>> wrote:
> > > > >>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>> Hello John,
> > > > >>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>> Thanks for the KIP. Some comments:
> > > > >>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>> 1. Could you list all the possible values of the
> > "reason"
> > > > >> tag?
> > > > >>>>>>>>> In
> > > > >>>>>>>>>> the
> > > > >>>>>>>>>>>>> JIRA
> > > > >>>>>>>>>>>>>> ticket I left some potential reasons but I'm not clear
> > if
> > > > >>> you're
> > > > >>>>>>>>>>> going
> > > > >>>>>>>>>>>> to
> > > > >>>>>>>>>>>>>> categorize each of them as a separate reason, or is
> > there
> > > > any
> > > > >>>>>>>>>>>> additional
> > > > >>>>>>>>>>>>>> ones you have in mind.
> > > > >>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>> Also I'm wondering if we should add another metric
> that
> > do
> > > > >> not
> > > > >>>>>>>>> have
> > > > >>>>>>>>>>> the
> > > > >>>>>>>>>>>>>> reason tag but aggregates among all possible reasons?
> > This
> > > > is
> > > > >>>>>>>>> for
> > > > >>>>>>>>>>> users
> > > > >>>>>>>>>>>>> to
> > > > >>>>>>>>>>>>>> easily set their alerting notifications (otherwise
> they
> > > have
> > > > >> to
> > > > >>>>>>>>>> write
> > > > >>>>>>>>>>>> on
> > > > >>>>>>>>>>>>>> notification rule per reason) in their monitoring
> > systems.
> > > > >>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>> 2. Note that the processor-node metrics is actually
> > > > >>> "per-thread,
> > > > >>>>>>>>>>>>> per-task,
> > > > >>>>>>>>>>>>>> per-processor-node", and today we only set the
> > per-thread
> > > > >>>>>>>>> metrics
> > > > >>>>>>>>>> as
> > > > >>>>>>>>>>>> INFO
> > > > >>>>>>>>>>>>>> while leaving the lower two layers as DEBUG. I agree
> > with
> > > > >> your
> > > > >>>>>>>>>>> argument
> > > > >>>>>>>>>>>>>> that we are missing the per-client roll-up metrics
> > today,
> > > > but
> > > > >>>>>>>>> I'm
> > > > >>>>>>>>>>>>> convinced
> > > > >>>>>>>>>>>>>> that the right way to approach it would be
> > > > >>>>>>>>>>> "just-providing-the-lowest-
> > > > >>>>>>>>>>>>>> level
> > > > >>>>>>>>>>>>>> metrics only".
> > > > >>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>> Note the recoding implementation of these three levels
> > are
> > > > >>>>>>>>>> different
> > > > >>>>>>>>>>>>>> internally today: we did not just do the rolling up to
> > > > >> generate
> > > > >>>>>>>>> the
> > > > >>>>>>>>>>>>>> higher-level metrics from the lower level ones, but we
> > > just
> > > > >>>>>>>>> record
> > > > >>>>>>>>>>> them
> > > > >>>>>>>>>>>>>> separately, which means that, if we turn on multiple
> > > levels
> > > > >> of
> > > > >>>>>>>>>>> metrics,
> > > > >>>>>>>>>>>>> we
> > > > >>>>>>>>>>>>>> maybe duplicate collecting some metrics. One can argue
> > > that
> > > > >> is
> > > > >>>>>>>>> not
> > > > >>>>>>>>>>> the
> > > > >>>>>>>>>>>>> best
> > > > >>>>>>>>>>>>>> way to represent multi-level metrics collecting and
> > > > >> reporting,
> > > > >>>>>>>>> but
> > > > >>>>>>>>>> by
> > > > >>>>>>>>>>>>> only
> > > > >>>>>>>>>>>>>> enabling thread-level metrics as INFO today, that
> > > > >>> implementation
> > > > >>>>>>>>>>> could
> > > > >>>>>>>>>>>> be
> > > > >>>>>>>>>>>>>> more efficient than only collecting the metrics at the
> > > > lowest
> > > > >>>>>>>>>> level,
> > > > >>>>>>>>>>>> and
> > > > >>>>>>>>>>>>>> then do the roll-up calculations outside of the
> metrics
> > > > >>> classes.
> > > > >>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>> Plus, today not all processor-nodes may possibly skip
> > > > >> records,
> > > > >>>>>>>>>> AFAIK
> > > > >>>>>>>>>>> we
> > > > >>>>>>>>>>>>>> will only skip records at the source, sink, window and
> > > > >>>>>>>>> aggregation
> > > > >>>>>>>>>>>>>> processor nodes, so adding a metric per processor
> looks
> > > like
> > > > >> an
> > > > >>>>>>>>>>>> overkill
> > > > >>>>>>>>>>>>> to
> > > > >>>>>>>>>>>>>> me as well. On the other hand, from user's perspective
> > the
> > > > >>>>>>>>> "reason"
> > > > >>>>>>>>>>> tag
> > > > >>>>>>>>>>>>> may
> > > > >>>>>>>>>>>>>> be sufficient for them to narrow down where inside the
> > > > >> topology
> > > > >>>>>>>>> is
> > > > >>>>>>>>>>>>> causing
> > > > >>>>>>>>>>>>>> records to be dropped on the floor. So I think the
> > > > >> "per-thread,
> > > > >>>>>>>>>>>> per-task"
> > > > >>>>>>>>>>>>>> level metrics should be sufficient for them in trouble
> > > shoot
> > > > >> in
> > > > >>>>>>>>>> DEBUG
> > > > >>>>>>>>>>>>> mode,
> > > > >>>>>>>>>>>>>> and we can add another "per-thread" level metrics as
> > INFO
> > > > >> which
> > > > >>>>>>>>> is
> > > > >>>>>>>>>>>> turned
> > > > >>>>>>>>>>>>>> on by default. So under normal execution users still
> > only
> > > > >> need
> > > > >>>>>>>>> INFO
> > > > >>>>>>>>>>>> level
> > > > >>>>>>>>>>>>>> metrics for alerting (e.g. set alerts on all
> > > skipped-records
> > > > >>>>>>>>>> metrics
> > > > >>>>>>>>>>> as
> > > > >>>>>>>>>>>>>> non-zero), and then upon trouble shooting they can
> turn
> > on
> > > > >>> DEBUG
> > > > >>>>>>>>>>>> metrics
> > > > >>>>>>>>>>>>> to
> > > > >>>>>>>>>>>>>> look into which task is actually causing the skipped
> > > > records.
> > > > >>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>> Guozhang
> > > > >>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>> On Thu, Mar 29, 2018 at 2:03 PM, Matthias J. Sax <
> > > > >>>>>>>>>>>> matthias@confluent.io>
> > > > >>>>>>>>>>>>>> wrote:
> > > > >>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>> Thanks for the KIP John.
> > > > >>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>> Reading the material on the related Jiras, I am
> > wondering
> > > > >> what
> > > > >>>>>>>>>>>> `reason`
> > > > >>>>>>>>>>>>>>> tags you want to introduce? Can you elaborate? The
> KIP
> > > > >> should
> > > > >>>>>>>>>> list
> > > > >>>>>>>>>>>>> those
> > > > >>>>>>>>>>>>>>> IMHO.
> > > > >>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>> About the fine grained metrics vs the roll-up: you
> say
> > > that
> > > > >>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>> the coarse metric aggregates across two dimensions
> > > > >>>>>>>>>> simultaneously
> > > > >>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>> Can you elaborate why this is an issue? I am not
> > > convinced
> > > > >> atm
> > > > >>>>>>>>>> that
> > > > >>>>>>>>>>>> we
> > > > >>>>>>>>>>>>>>> should put the fine grained metrics into INFO level
> and
> > > > >> remove
> > > > >>>>>>>>>> the
> > > > >>>>>>>>>>>>>>> roll-up at thread level.
> > > > >>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>> Given that they have to do this sum to get a usable
> > > > >>>>>>>>> top-level
> > > > >>>>>>>>>>> view
> > > > >>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>> This is a fair concern, but I don't share the
> > conclusion.
> > > > >>>>>>>>>> Offering
> > > > >>>>>>>>>>> a
> > > > >>>>>>>>>>>>>>> built-in `KafkaStreams` "client" roll-up out of the
> box
> > > > >> might
> > > > >>>>>>>>> be
> > > > >>>>>>>>>> a
> > > > >>>>>>>>>>>>>>> better solution. In the past we did not offer this
> due
> > to
> > > > >>>>>>>>>>> performance
> > > > >>>>>>>>>>>>>>> concerns, but we could allow an "opt-in" mechanism.
> If
> > > you
> > > > >>>>>>>>>>> disagree,
> > > > >>>>>>>>>>>>> can
> > > > >>>>>>>>>>>>>>> you provide some reasoning and add them to the
> > "Rejected
> > > > >>>>>>>>>>>> alternatives"
> > > > >>>>>>>>>>>>>>> section.
> > > > >>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>> To rephrase: I understand the issue about missing
> > > top-level
> > > > >>>>>>>>> view,
> > > > >>>>>>>>>>> but
> > > > >>>>>>>>>>>>>>> instead of going more fine grained, we should
> consider
> > to
> > > > >> add
> > > > >>>>>>>>>> this
> > > > >>>>>>>>>>>>>>> top-level view and add/keep the fine grained metrics
> at
> > > > >> DEBUG
> > > > >>>>>>>>>> level
> > > > >>>>>>>>>>>>> only
> > > > >>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>> I am +1 to add TopologyTestDriver#metrics() and to
> > remove
> > > > >> old
> > > > >>>>>>>>>>> metrics
> > > > >>>>>>>>>>>>>>> directly as you suggested.
> > > > >>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>> -Matthias
> > > > >>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>> On 3/28/18 6:42 PM, Ted Yu wrote:
> > > > >>>>>>>>>>>>>>>> Looks good to me.
> > > > >>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>> On Wed, Mar 28, 2018 at 3:11 PM, John Roesler <
> > > > >>>>>>>>>> john@confluent.io
> > > > >>>>>>>>>>>>
> > > > >>>>>>>>>>>>>> wrote:
> > > > >>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>>> Hello all,
> > > > >>>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>>> I am proposing KIP-274 to improve the metrics
> around
> > > > >>>>>>>>> skipped
> > > > >>>>>>>>>>>> records
> > > > >>>>>>>>>>>>>> in
> > > > >>>>>>>>>>>>>>>>> Streams.
> > > > >>>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>>> Please find the details here:
> > > > >>>>>>>>>>>>>>>>> https://cwiki.apache.org/
> > confluence/display/KAFKA/KIP-
> > > > >>>>>>>>>>>>>>>>> 274%3A+Kafka+Streams+Skipped+Records+Metrics
> > > > >>>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>>> Please let me know what you think!
> > > > >>>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>>> Thanks,
> > > > >>>>>>>>>>>>>>>>> -John
> > > > >>>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>> --
> > > > >>>>>>>>>>>>>> -- Guozhang
> > > > >>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>
> > > > >>>>>>>>>>>>
> > > > >>>>>>>>>>>>
> > > > >>>>>>>>>>>>
> > > > >>>>>>>>>>>> --
> > > > >>>>>>>>>>>> -- Guozhang
> > > > >>>>>>>>>>>>
> > > > >>>>>>>>>>>
> > > > >>>>>>>>>>
> > > > >>>>>>>>>>
> > > > >>>>>>>>>>
> > > > >>>>>>>>>> --
> > > > >>>>>>>>>> -- Guozhang
> > > > >>>>>>>>>>
> > > > >>>>>>>>>
> > > > >>>>>>>>
> > > > >>>>>>>>
> > > > >>>>>>>
> > > > >>>>>>
> > > > >>>>>>
> > > > >>>>>
> > > > >>>>
> > > > >>>>
> > > > >>>>
> > > > >>>
> > > > >>>
> > > > >>
> > > > >
> > > > >
> > > > >
> > > >
> > > >
> > >
> > >
> > > --
> > > -- Guozhang
> > >
> >
>



-- 
-- Guozhang

Re: [DISCUSS] KIP-274: Kafka Streams Skipped Records Metrics

Posted by Bill Bejeck <bb...@gmail.com>.

Hi John,

Thanks for making the updates.

I agree with the information you've included in the logs as described
above, as log statements without enough context/information can be
frustrating.

-Bill

On Tue, Apr 3, 2018 at 3:29 PM, John Roesler <jo...@confluent.io> wrote:

> Allrighty, how about this, then...
>
> I'll move the metric back to the StreamThread and maintain the existing tag
> (client-id=...(per-thread client-id)). It won't be present in the
> TopologyTestDriver's metrics.
>
> As a side note, I'm not sure that the location of the log messages has
> visibility into the name of the thread or the task, or the processor node,
> for that matter. But at the end of the day, I don't think it really
> matters.
>
> None of those identifiers are in the public interface or user-controlled.
> For them to be useful for debugging, users would have to gain a very deep
> understanding of how their DSL program gets executed. From my perspective,
> they are all included in metric tags only to prevent collisions between the
> same metrics in different (e.g.) threads.
>
> I think what's important is to provide the right information in the logs
> that users will be able to debug their issues. This is why the logs in my
> pr include the topic/partition/offset of the offending data, as well as the
> stacktrace of the exception from the deserializer (or for timestamps, the
> extracted timestamp and the class name of their extractor). This
> information alone should let them pinpoint the offending data and fix it.
>
> (I am aware that that topic name might be a repartition topic, and
> therefore also esoteric from the user's perspective, but I think it's the
> best we can do right now. It might be nice to explicitly take on a
> debugging ergonomics task in the future and give all processor nodes
> human-friendly names. Then, we could surface these names in any logs or
> exceptions. But I'm inclined to call this out-of-scope for now.)
>
> Thanks again,
> -John
>
> On Tue, Apr 3, 2018 at 1:40 PM, Guozhang Wang <wa...@gmail.com> wrote:
>
> > 1. If we can indeed gather all the context information from the log4j
> > entries I'd suggest we change to thread-level (I'm not sure if that is
> > doable, so if John have already some WIP PR that can help us decide).
> >
> > 2. We can consider adding the API in TopologyTestDriver for general
> testing
> > purposes; that being said, I think Matthias has a good point that this
> > alone should not be a driving motivation for us to keep this metric as
> > task-level if 1) is true.
> >
> >
> >
> > Guozhang
> >
> >
> > On Tue, Apr 3, 2018 at 11:36 AM, Matthias J. Sax <ma...@confluent.io>
> > wrote:
> >
> > > Thanks Guozhang, that was my intent.
> > >
> > > @John: yes, we should not nail down the exact log message. It's just to
> > > point out the trade-off. If we can get the required information in the
> > > logs, we might not need task level metrics.
> > >
> > >
> > > -Matthias
> > >
> > > On 4/3/18 11:26 AM, Guozhang Wang wrote:
> > > > I think Matthias' comment is that, we can still record the metrics on
> > the
> > > > thread-level, while having the WARN log entry to include sufficient
> > > context
> > > > information so that users can still easily narrow down the
> > investigation
> > > > scope.
> > > >
> > > >
> > > > Guozhang
> > > >
> > > > On Tue, Apr 3, 2018 at 11:22 AM, John Roesler <jo...@confluent.io>
> > wrote:
> > > >
> > > >> I agree we should add as much information as is reasonable to the
> log.
> > > For
> > > >> example, see this WIP PR I started for this KIP:
> > > >>
> > > >> https://github.com/apache/kafka/pull/4812/files#diff-
> > > >> 88d129f048bc842c7db5b2566a45fce8R80
> > > >>
> > > >> and
> > > >>
> > > >> https://github.com/apache/kafka/pull/4812/files#diff-
> > > >> 69e6789eb675ec978a1abd24fed96eb1R111
> > > >>
> > > >> I'm not sure if we should nail down the log messages in the KIP or
> in
> > > the
> > > >> PR discussion. What say you?
> > > >>
> > > >> Thanks,
> > > >> -John
> > > >>
> > > >> On Tue, Apr 3, 2018 at 12:20 AM, Matthias J. Sax <
> > matthias@confluent.io
> > > >
> > > >> wrote:
> > > >>
> > > >>> Thanks for sharing your thoughts. As I mentioned originally, I am
> not
> > > >>> sure about the right log level either. Your arguments are
> convincing
> > --
> > > >>> thus, I am fine with keeping WARN level.
> > > >>>
> > > >>> The task vs thread level argument is an interesting one. However, I
> > am
> > > >>> wondering if we should add this information into the corresponding
> > WARN
> > > >>> logs that we write anyway? For this case, we can also log the
> > > >>> corresponding operator (and other information like topic name etc
> if
> > > >>> needed). WDYT about this?
> > > >>>
> > > >>>
> > > >>> -Matthias
> > > >>>
> > > >>> On 4/2/18 8:31 PM, Guozhang Wang wrote:
> > > >>>> Regarding logging: I'm inclined to keep logging at WARN level
> since
> > > >>> skipped
> > > >>>> records are not expected in normal execution (for all reasons that
> > we
> > > >> are
> > > >>>> aware of), and hence when error happens users should be alerted
> from
> > > >>>> metrics and looked into the log files, so to me if it is really
> > > >> spamming
> > > >>>> the log files it is also a good alert for users. Besides for
> > > >> deserialize
> > > >>>> errors we already log at WARN level for this reason.
> > > >>>>
> > > >>>> Regarding the metrics-levels: I was pondering on that as well.
> What
> > > >> made
> > > >>> me
> > > >>>> to think and agree on task-level than thread-level is that for
> some
> > > >>> reasons
> > > >>>> like window retention, they may possibly be happening on a subset
> of
> > > >>> input
> > > >>>> partitions, and tasks are correlated with partitions the
> task-level
> > > >>> metrics
> > > >>>> can help users to narrow down on the specific input data
> partitions.
> > > >>>>
> > > >>>>
> > > >>>> Guozhang
> > > >>>>
> > > >>>>
> > > >>>> On Mon, Apr 2, 2018 at 6:43 PM, John Roesler <jo...@confluent.io>
> > > >> wrote:
> > > >>>>
> > > >>>>> Hi Matthias,
> > > >>>>>
> > > >>>>> No worries! Thanks for the reply.
> > > >>>>>
> > > >>>>> 1) There isn't a connection. I tried using the TopologyTestDriver
> > to
> > > >>> write
> > > >>>>> a quick test exercising the current behavior and discovered that
> > the
> > > >>>>> metrics weren't available. It seemed like they should be, so I
> > tacked
> > > >>> it on
> > > >>>>> to this KIP. If you feel it's inappropriate, I can pull it back
> > out.
> > > >>>>>
> > > >>>>> 2) I was also concerned about that, but I figured it would come
> up
> > in
> > > >>>>> discussion if I just went ahead and proposed it. And here we are!
> > > >>>>>
> > > >>>>> Here's my thought: maybe there are two classes of skips:
> > "controlled"
> > > >>> and
> > > >>>>> "uncontrolled", where "controlled" means, as an app author, I
> > > >>> deliberately
> > > >>>>> filter out some events, and "uncontrolled" means that I simply
> > don't
> > > >>>>> account for some feature of the data, and the framework skips
> them
> > > (as
> > > >>>>> opposed to crashing).
> > > >>>>>
> > > >>>>> In this breakdowns, the skips I'm adding metrics for are all
> > > >>> uncontrolled
> > > >>>>> skips (and we hope to measure all the uncontrolled skips). Our
> > skips
> > > >> are
> > > >>>>> well documented, so it wouldn't be terrible to have an
> application
> > in
> > > >>> which
> > > >>>>> you know you expect to have tons of uncontrolled skips, but it's
> > not
> > > >>> great
> > > >>>>> either, since you may also have some *unexpected* uncontrolled
> > skips.
> > > >>> It'll
> > > >>>>> be difficult to notice, since you're probably not alerting on the
> > > >> metric
> > > >>>>> and filtering out the logs (whatever their level).
> > > >>>>>
> > > >>>>> I'd recommend any app author, as an alternative, to convert all
> > > >> expected
> > > >>>>> skips to controlled ones, by updating the topology to filter
> those
> > > >>> records
> > > >>>>> out.
> > > >>>>>
> > > >>>>> Following from my recommendation, as a library author, I'm
> inclined
> > > to
> > > >>> mark
> > > >>>>> those logs WARN, since in my opinion, they should be concerning
> to
> > > the
> > > >>> app
> > > >>>>> authors. I'd definitely want to show, rather than hide, them by
> > > >>> default, so
> > > >>>>> I would pick INFO at least.
> > > >>>>>
> > > >>>>> That said, logging is always a tricky issue for lower-level
> > libraries
> > > >>> that
> > > >>>>> run inside user code, since we don't have all the information we
> > need
> > > >> to
> > > >>>>> make the right call.
> > > >>>>>
> > > >>>>>
> > > >>>>>
> > > >>>>> On your last note, yeah, I got that impression from Guozhang as
> > well.
> > > >>>>> Thanks for the clarification.
> > > >>>>>
> > > >>>>> -John
> > > >>>>>
> > > >>>>>
> > > >>>>>
> > > >>>>> On Mon, Apr 2, 2018 at 4:03 PM, Matthias J. Sax <
> > > >> matthias@confluent.io>
> > > >>>>> wrote:
> > > >>>>>
> > > >>>>>> John,
> > > >>>>>>
> > > >>>>>> sorry for my late reply and thanks for updating the KIP.
> > > >>>>>>
> > > >>>>>> I like your approach about "metrics are for monitoring, logs are
> > for
> > > >>>>>> debugging" -- however:
> > > >>>>>>
> > > >>>>>> 1) I don't see a connection between this and the task-level
> > metrics
> > > >>> that
> > > >>>>>> you propose to get the metrics in `TopologyTestDriver`. I don't
> > > think
> > > >>>>>> people would monitor the `TopologyTestDriver` an thus wondering
> > why
> > > >> it
> > > >>>>>> is important to include the metrics there? Thread-level metric
> > might
> > > >> be
> > > >>>>>> easier to monitor though (ie, less different metric to monitor).
> > > >>>>>>
> > > >>>>>> 2) I am a little worried about WARN level logging and that it
> > might
> > > >> be
> > > >>>>>> too chatty -- as you pointed out, it's about debugging, thus
> DEBUG
> > > >>> level
> > > >>>>>> might be better. Not 100% sure about this to be honest. What is
> > the
> > > >>>>>> general assumption about the frequency for skipped records? I
> > could
> > > >>>>>> imagine cases for which skipped records are quite frequent and
> > thus,
> > > >>>>>> WARN level logs might "flood" the logs
> > > >>>>>>
> > > >>>>>> One final remark:
> > > >>>>>>
> > > >>>>>>> More
> > > >>>>>>> generally, I would like to establish a pattern in which we
> could
> > > add
> > > >>>>> new
> > > >>>>>>> values for the "reason" tags without needing a KIP to do so.
> > > >>>>>>
> > > >>>>>> From my understanding, this is not feasible. Changing metrics is
> > > >> always
> > > >>>>>> considered a public API change, and we need a KIP for any
> change.
> > As
> > > >> we
> > > >>>>>> moved away from tagging, it doesn't matter for the KIP anymore
> --
> > > >> just
> > > >>>>>> wanted to point it out.
> > > >>>>>>
> > > >>>>>>
> > > >>>>>> -Matthias
> > > >>>>>>
> > > >>>>>>
> > > >>>>>> On 3/30/18 2:47 PM, John Roesler wrote:
> > > >>>>>>> Allrighty! The KIP is updated.
> > > >>>>>>>
> > > >>>>>>> Thanks again, all, for the feedback.
> > > >>>>>>> -John
> > > >>>>>>>
> > > >>>>>>> On Fri, Mar 30, 2018 at 3:35 PM, John Roesler <
> john@confluent.io
> > >
> > > >>>>> wrote:
> > > >>>>>>>
> > > >>>>>>>> Hey Guozhang and Bill,
> > > >>>>>>>>
> > > >>>>>>>> Ok, I'll update the KIP. At the risk of disturbing consensus,
> > I'd
> > > >>> like
> > > >>>>>> to
> > > >>>>>>>> put it in the task instead of the thread so that it'll show up
> > in
> > > >> the
> > > >>>>>>>> TopologyTestDriver metrics as well.
> > > >>>>>>>>
> > > >>>>>>>> I'm leaning toward keeping the scope where it is right now,
> but
> > if
> > > >>>>>> others
> > > >>>>>>>> want to advocate for tossing in some more metrics, we can go
> > that
> > > >>>>> route.
> > > >>>>>>>>
> > > >>>>>>>> Thanks all,
> > > >>>>>>>> -John
> > > >>>>>>>>
> > > >>>>>>>> On Fri, Mar 30, 2018 at 2:37 PM, Bill Bejeck <
> bbejeck@gmail.com
> > >
> > > >>>>> wrote:
> > > >>>>>>>>
> > > >>>>>>>>> Thanks for the KIP John, and sorry for the late comments.
> > > >>>>>>>>>
> > > >>>>>>>>> I'm on the fence with providing a single level metrics, but I
> > > >> think
> > > >>>>>> we'll
> > > >>>>>>>>> have that discussion outside of this KIP.
> > > >>>>>>>>>
> > > >>>>>>>>>> * maintain one skipped-record metric (could be per-thread,
> > > >>> per-task,
> > > >>>>>> or
> > > >>>>>>>>>> per-processor-node) with no "reason"
> > > >>>>>>>>>> * introduce a warn-level log detailing the
> > > topic/partition/offset
> > > >>>>> and
> > > >>>>>>>>>> reason of the skipped record
> > > >>>>>>>>>
> > > >>>>>>>>> I'm +1 on both of these suggestions.
> > > >>>>>>>>>
> > > >>>>>>>>> Finally, we have had requests in the past for some metrics
> > around
> > > >>>>> when
> > > >>>>>>>>> persistent store removes an expired window.  Would adding
> that
> > to
> > > >>> our
> > > >>>>>>>>> metrics stretch the scope of this KIP too much?
> > > >>>>>>>>>
> > > >>>>>>>>> Thanks again and overall I'm +1 on this KIP
> > > >>>>>>>>>
> > > >>>>>>>>> Bill
> > > >>>>>>>>>
> > > >>>>>>>>> On Fri, Mar 30, 2018 at 2:00 PM, Guozhang Wang <
> > > >> wangguoz@gmail.com>
> > > >>>>>>>>> wrote:
> > > >>>>>>>>>
> > > >>>>>>>>>> The proposal sounds good to me. About "maintain only one
> level
> > > of
> > > >>>>>>>>> metrics"
> > > >>>>>>>>>> maybe we can discuss about that separately from this KIP
> since
> > > >> that
> > > >>>>>>>>> would
> > > >>>>>>>>>> be a larger scope of discussion. I agree that if we are
> going
> > to
> > > >>>>>>>>> maintain
> > > >>>>>>>>>> only one-level metrics it should be lowest level and we
> would
> > > let
> > > >>>>>> users
> > > >>>>>>>>> to
> > > >>>>>>>>>> do the roll-ups themselves, but I'm still not fully
> convinced
> > > >> that
> > > >>>>> we
> > > >>>>>>>>>> should just provide single-level metrics, because 1) I think
> > for
> > > >>>>>>>>> different
> > > >>>>>>>>>> metrics people may be interested to investigate into
> different
> > > >>>>>>>>>> granularities, e.g. for poll / commit rate these are at the
> > > >> lowest
> > > >>>>>>>>>> task-level metrics, while for process-rate / skip-rate they
> > can
> > > >> be
> > > >>>>> as
> > > >>>>>>>>> low
> > > >>>>>>>>>> as processor-node metrics, and 2) user-side rolling ups may
> > not
> > > >> be
> > > >>>>>> very
> > > >>>>>>>>>> straight-forward. But for 2) if someone can provide an
> > efficient
> > > >>> and
> > > >>>>>>>>> easy
> > > >>>>>>>>>> implementation of that I can be persuaded :)
> > > >>>>>>>>>>
> > > >>>>>>>>>> For now I'm thinking we can add the metric on thread-level,
> > > >> either
> > > >>>>>> with
> > > >>>>>>>>>> finer grained ones with "reason" tag plus an aggregated one
> > > >> without
> > > >>>>>> the
> > > >>>>>>>>>> tag, or just having a single aggregated metric without the
> tag
> > > >>> looks
> > > >>>>>>>>> good
> > > >>>>>>>>>> to me.
> > > >>>>>>>>>>
> > > >>>>>>>>>>
> > > >>>>>>>>>> Guozhang
> > > >>>>>>>>>>
> > > >>>>>>>>>> On Fri, Mar 30, 2018 at 8:05 AM, John Roesler <
> > > john@confluent.io
> > > >>>
> > > >>>>>>>>> wrote:
> > > >>>>>>>>>>
> > > >>>>>>>>>>> Hey Guozhang,
> > > >>>>>>>>>>>
> > > >>>>>>>>>>> Thanks for the reply. Regarding JMX, I can dig it. I'll
> > provide
> > > >> a
> > > >>>>>>>>> list in
> > > >>>>>>>>>>> the KIP. I was also thinking we'd better start a
> > documentation
> > > >>> page
> > > >>>>>>>>> with
> > > >>>>>>>>>>> the metrics listed.
> > > >>>>>>>>>>>
> > > >>>>>>>>>>> I'd have no problem logging a warning when we skip records.
> > On
> > > >> the
> > > >>>>>>>>> metric
> > > >>>>>>>>>>> front, really I'm just pushing for us to maintain only one
> > > level
> > > >>> of
> > > >>>>>>>>>>> metrics. If that's more or less granular (i.e., maybe we
> > don't
> > > >>>>> have a
> > > >>>>>>>>>>> metric per reason and log the reason instead), that's fine
> by
> > > >> me.
> > > >>> I
> > > >>>>>>>>> just
> > > >>>>>>>>>>> don't think it provides a lot of extra value per complexity
> > > >>>>>> (interface
> > > >>>>>>>>>> and
> > > >>>>>>>>>>> implementation) to maintain roll-ups at the thread level in
> > > >>>>> addition
> > > >>>>>>>>> to
> > > >>>>>>>>>>> lower-level metrics.
> > > >>>>>>>>>>>
> > > >>>>>>>>>>> How about this instead:
> > > >>>>>>>>>>> * maintain one skipped-record metric (could be per-thread,
> > > >>>>> per-task,
> > > >>>>>>>>> or
> > > >>>>>>>>>>> per-processor-node) with no "reason"
> > > >>>>>>>>>>> * introduce a warn-level log detailing the
> > > >> topic/partition/offset
> > > >>>>> and
> > > >>>>>>>>>>> reason of the skipped record
> > > >>>>>>>>>>>
> > > >>>>>>>>>>> If you like that, I can update the KIP.
> > > >>>>>>>>>>>
> > > >>>>>>>>>>> Thanks,
> > > >>>>>>>>>>> -John
> > > >>>>>>>>>>>
> > > >>>>>>>>>>>
> > > >>>>>>>>>>>
> > > >>>>>>>>>>> On Thu, Mar 29, 2018 at 6:22 PM, Guozhang Wang <
> > > >>> wangguoz@gmail.com
> > > >>>>>>
> > > >>>>>>>>>> wrote:
> > > >>>>>>>>>>>
> > > >>>>>>>>>>>>> One thing you mention is the notion of setting alerts on
> > > >> coarser
> > > >>>>>>>>>>> metrics
> > > >>>>>>>>>>>> being easier than finer ones. All the metric alerting
> > systems
> > > I
> > > >>>>> have
> > > >>>>>>>>>> used
> > > >>>>>>>>>>>> make it equally easy to alert on metrics by-tag or over
> > tags.
> > > >> So
> > > >>>>> my
> > > >>>>>>>>>>>> experience doesn't say that this is a use case. Were you
> > > >> thinking
> > > >>>>>>>>> of an
> > > >>>>>>>>>>>> alerting system that makes such a pre-aggregation
> valuable?
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>> For the commonly used JMX reporter tags will be encoded
> > > >> directly
> > > >>>>> as
> > > >>>>>>>>>> part
> > > >>>>>>>>>>> of
> > > >>>>>>>>>>>> the object name, and if users wants to monitor them they
> > need
> > > >> to
> > > >>>>>>>>> know
> > > >>>>>>>>>>> these
> > > >>>>>>>>>>>> values before hand. That is also why I think we do want to
> > > list
> > > >>>>> all
> > > >>>>>>>>> the
> > > >>>>>>>>>>>> possible values of the reason tags in the KIP, since
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>>> In my email in response to Matthias, I gave an example of
> > the
> > > >>>>>>>>> kind of
> > > >>>>>>>>>>>> scenario that would lead me as an operator to run with
> DEBUG
> > > on
> > > >>>>> all
> > > >>>>>>>>> the
> > > >>>>>>>>>>>> time, since I wouldn't be sure, having seen a skipped
> record
> > > >>> once,
> > > >>>>>>>>> that
> > > >>>>>>>>>>> it
> > > >>>>>>>>>>>> would ever happen again. The solution is to capture all
> the
> > > >>>>>>>>> available
> > > >>>>>>>>>>>> information about the reason and location of skips all the
> > > >> time.
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>> That is a good point. I think we can either expose all
> > levels
> > > >>>>>>>>> metrics
> > > >>>>>>>>>> as
> > > >>>>>>>>>>> by
> > > >>>>>>>>>>>> default, or only expose the most lower-level metrics and
> get
> > > >> rid
> > > >>>>> of
> > > >>>>>>>>>> other
> > > >>>>>>>>>>>> levels to let users do roll-ups themselves (which will be
> a
> > > >> much
> > > >>>>>>>>> larger
> > > >>>>>>>>>>>> scope for discussion), or we can encourage users to not
> > purely
> > > >>>>>>>>> depend
> > > >>>>>>>>>> on
> > > >>>>>>>>>>>> metrics for such trouble shooting: that is to say, users
> > only
> > > >> be
> > > >>>>>>>>>> alerted
> > > >>>>>>>>>>>> based on metrics, and we can log a info / warn log4j entry
> > > each
> > > >>>>>>>>> time we
> > > >>>>>>>>>>> are
> > > >>>>>>>>>>>> about to skip a record all over the places, so that upon
> > being
> > > >>>>>>>>> notified
> > > >>>>>>>>>>>> users can look into the logs to find the details on where
> /
> > > >> when
> > > >>>>> it
> > > >>>>>>>>>>>> happens. WDYT?
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>> Guozhang
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>> On Thu, Mar 29, 2018 at 3:57 PM, John Roesler <
> > > >> john@confluent.io
> > > >>>>
> > > >>>>>>>>>> wrote:
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>>> Hey Guozhang,
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>> Thanks for the review.
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>> 1.
> > > >>>>>>>>>>>>> Matthias raised the same question about the "reason" tag
> > > >> values.
> > > >>>>> I
> > > >>>>>>>>>> can
> > > >>>>>>>>>>>> list
> > > >>>>>>>>>>>>> all possible values of the "reason" tag, but I'm thinking
> > > this
> > > >>>>>>>>> level
> > > >>>>>>>>>> of
> > > >>>>>>>>>>>>> detail may not be KIP-worthy, maybe the code and
> > > documentation
> > > >>>>>>>>> review
> > > >>>>>>>>>>>> would
> > > >>>>>>>>>>>>> be sufficient. If you all disagree and would like it
> > included
> > > >> in
> > > >>>>>>>>> the
> > > >>>>>>>>>>>> KIP, I
> > > >>>>>>>>>>>>> can certainly do that.
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>> If we do provide roll-up metrics, I agree with the
> pattern
> > of
> > > >>>>>>>>> keeping
> > > >>>>>>>>>>> the
> > > >>>>>>>>>>>>> same name but eliminating the tags for the dimensions
> that
> > > >> were
> > > >>>>>>>>>>>> rolled-up.
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>> 2.
> > > >>>>>>>>>>>>> I'm not too sure that implementation efficiency really
> > > >> becomes a
> > > >>>>>>>>>> factor
> > > >>>>>>>>>>>> in
> > > >>>>>>>>>>>>> choosing whether to (by default) update one coarse metric
> > at
> > > >> the
> > > >>>>>>>>>> thread
> > > >>>>>>>>>>>>> level or one granular metric at the processor-node level,
> > > >> since
> > > >>>>>>>>> it's
> > > >>>>>>>>>>> just
> > > >>>>>>>>>>>>> one metric being updated either way. I do agree that if
> we
> > > >> were
> > > >>>>> to
> > > >>>>>>>>>>> update
> > > >>>>>>>>>>>>> the granular metrics and multiple roll-ups, then we
> should
> > > >>>>>>>>> consider
> > > >>>>>>>>>> the
> > > >>>>>>>>>>>>> efficiency.
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>> I agree it's probably not necessary to surface the
> metrics
> > > for
> > > >>>>> all
> > > >>>>>>>>>>> nodes
> > > >>>>>>>>>>>>> regardless of whether they can or do skip records.
> Perhaps
> > we
> > > >>> can
> > > >>>>>>>>>>> lazily
> > > >>>>>>>>>>>>> register the metrics.
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>> In my email in response to Matthias, I gave an example of
> > the
> > > >>>>>>>>> kind of
> > > >>>>>>>>>>>>> scenario that would lead me as an operator to run with
> > DEBUG
> > > >> on
> > > >>>>>>>>> all
> > > >>>>>>>>>> the
> > > >>>>>>>>>>>>> time, since I wouldn't be sure, having seen a skipped
> > record
> > > >>>>> once,
> > > >>>>>>>>>> that
> > > >>>>>>>>>>>> it
> > > >>>>>>>>>>>>> would ever happen again. The solution is to capture all
> the
> > > >>>>>>>>> available
> > > >>>>>>>>>>>>> information about the reason and location of skips all
> the
> > > >> time.
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>> One thing you mention is the notion of setting alerts on
> > > >> coarser
> > > >>>>>>>>>>> metrics
> > > >>>>>>>>>>>>> being easier than finer ones. All the metric alerting
> > systems
> > > >> I
> > > >>>>>>>>> have
> > > >>>>>>>>>>> used
> > > >>>>>>>>>>>>> make it equally easy to alert on metrics by-tag or over
> > tags.
> > > >> So
> > > >>>>>>>>> my
> > > >>>>>>>>>>>>> experience doesn't say that this is a use case. Were you
> > > >>> thinking
> > > >>>>>>>>> of
> > > >>>>>>>>>> an
> > > >>>>>>>>>>>>> alerting system that makes such a pre-aggregation
> valuable?
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>> Thanks again,
> > > >>>>>>>>>>>>> -John
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>> On Thu, Mar 29, 2018 at 5:24 PM, Guozhang Wang <
> > > >>>>>>>>> wangguoz@gmail.com>
> > > >>>>>>>>>>>> wrote:
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>>> Hello John,
> > > >>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>> Thanks for the KIP. Some comments:
> > > >>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>> 1. Could you list all the possible values of the
> "reason"
> > > >> tag?
> > > >>>>>>>>> In
> > > >>>>>>>>>> the
> > > >>>>>>>>>>>>> JIRA
> > > >>>>>>>>>>>>>> ticket I left some potential reasons but I'm not clear
> if
> > > >>> you're
> > > >>>>>>>>>>> going
> > > >>>>>>>>>>>> to
> > > >>>>>>>>>>>>>> categorize each of them as a separate reason, or is
> there
> > > any
> > > >>>>>>>>>>>> additional
> > > >>>>>>>>>>>>>> ones you have in mind.
> > > >>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>> Also I'm wondering if we should add another metric that
> do
> > > >> not
> > > >>>>>>>>> have
> > > >>>>>>>>>>> the
> > > >>>>>>>>>>>>>> reason tag but aggregates among all possible reasons?
> This
> > > is
> > > >>>>>>>>> for
> > > >>>>>>>>>>> users
> > > >>>>>>>>>>>>> to
> > > >>>>>>>>>>>>>> easily set their alerting notifications (otherwise they
> > have
> > > >> to
> > > >>>>>>>>>> write
> > > >>>>>>>>>>>> on
> > > >>>>>>>>>>>>>> notification rule per reason) in their monitoring
> systems.
> > > >>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>> 2. Note that the processor-node metrics is actually
> > > >>> "per-thread,
> > > >>>>>>>>>>>>> per-task,
> > > >>>>>>>>>>>>>> per-processor-node", and today we only set the
> per-thread
> > > >>>>>>>>> metrics
> > > >>>>>>>>>> as
> > > >>>>>>>>>>>> INFO
> > > >>>>>>>>>>>>>> while leaving the lower two layers as DEBUG. I agree
> with
> > > >> your
> > > >>>>>>>>>>> argument
> > > >>>>>>>>>>>>>> that we are missing the per-client roll-up metrics
> today,
> > > but
> > > >>>>>>>>> I'm
> > > >>>>>>>>>>>>> convinced
> > > >>>>>>>>>>>>>> that the right way to approach it would be
> > > >>>>>>>>>>> "just-providing-the-lowest-
> > > >>>>>>>>>>>>>> level
> > > >>>>>>>>>>>>>> metrics only".
> > > >>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>> Note the recoding implementation of these three levels
> are
> > > >>>>>>>>>> different
> > > >>>>>>>>>>>>>> internally today: we did not just do the rolling up to
> > > >> generate
> > > >>>>>>>>> the
> > > >>>>>>>>>>>>>> higher-level metrics from the lower level ones, but we
> > just
> > > >>>>>>>>> record
> > > >>>>>>>>>>> them
> > > >>>>>>>>>>>>>> separately, which means that, if we turn on multiple
> > levels
> > > >> of
> > > >>>>>>>>>>> metrics,
> > > >>>>>>>>>>>>> we
> > > >>>>>>>>>>>>>> maybe duplicate collecting some metrics. One can argue
> > that
> > > >> is
> > > >>>>>>>>> not
> > > >>>>>>>>>>> the
> > > >>>>>>>>>>>>> best
> > > >>>>>>>>>>>>>> way to represent multi-level metrics collecting and
> > > >> reporting,
> > > >>>>>>>>> but
> > > >>>>>>>>>> by
> > > >>>>>>>>>>>>> only
> > > >>>>>>>>>>>>>> enabling thread-level metrics as INFO today, that
> > > >>> implementation
> > > >>>>>>>>>>> could
> > > >>>>>>>>>>>> be
> > > >>>>>>>>>>>>>> more efficient than only collecting the metrics at the
> > > lowest
> > > >>>>>>>>>> level,
> > > >>>>>>>>>>>> and
> > > >>>>>>>>>>>>>> then do the roll-up calculations outside of the metrics
> > > >>> classes.
> > > >>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>> Plus, today not all processor-nodes may possibly skip
> > > >> records,
> > > >>>>>>>>>> AFAIK
> > > >>>>>>>>>>> we
> > > >>>>>>>>>>>>>> will only skip records at the source, sink, window and
> > > >>>>>>>>> aggregation
> > > >>>>>>>>>>>>>> processor nodes, so adding a metric per processor looks
> > like
> > > >> an
> > > >>>>>>>>>>>> overkill
> > > >>>>>>>>>>>>> to
> > > >>>>>>>>>>>>>> me as well. On the other hand, from user's perspective
> the
> > > >>>>>>>>> "reason"
> > > >>>>>>>>>>> tag
> > > >>>>>>>>>>>>> may
> > > >>>>>>>>>>>>>> be sufficient for them to narrow down where inside the
> > > >> topology
> > > >>>>>>>>> is
> > > >>>>>>>>>>>>> causing
> > > >>>>>>>>>>>>>> records to be dropped on the floor. So I think the
> > > >> "per-thread,
> > > >>>>>>>>>>>> per-task"
> > > >>>>>>>>>>>>>> level metrics should be sufficient for them in trouble
> > shoot
> > > >> in
> > > >>>>>>>>>> DEBUG
> > > >>>>>>>>>>>>> mode,
> > > >>>>>>>>>>>>>> and we can add another "per-thread" level metrics as
> INFO
> > > >> which
> > > >>>>>>>>> is
> > > >>>>>>>>>>>> turned
> > > >>>>>>>>>>>>>> on by default. So under normal execution users still
> only
> > > >> need
> > > >>>>>>>>> INFO
> > > >>>>>>>>>>>> level
> > > >>>>>>>>>>>>>> metrics for alerting (e.g. set alerts on all
> > skipped-records
> > > >>>>>>>>>> metrics
> > > >>>>>>>>>>> as
> > > >>>>>>>>>>>>>> non-zero), and then upon trouble shooting they can turn
> on
> > > >>> DEBUG
> > > >>>>>>>>>>>> metrics
> > > >>>>>>>>>>>>> to
> > > >>>>>>>>>>>>>> look into which task is actually causing the skipped
> > > records.
> > > >>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>> Guozhang
> > > >>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>> On Thu, Mar 29, 2018 at 2:03 PM, Matthias J. Sax <
> > > >>>>>>>>>>>> matthias@confluent.io>
> > > >>>>>>>>>>>>>> wrote:
> > > >>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>> Thanks for the KIP John.
> > > >>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>> Reading the material on the related Jiras, I am
> wondering
> > > >> what
> > > >>>>>>>>>>>> `reason`
> > > >>>>>>>>>>>>>>> tags you want to introduce? Can you elaborate? The KIP
> > > >> should
> > > >>>>>>>>>> list
> > > >>>>>>>>>>>>> those
> > > >>>>>>>>>>>>>>> IMHO.
> > > >>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>> About the fine grained metrics vs the roll-up: you say
> > that
> > > >>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>> the coarse metric aggregates across two dimensions
> > > >>>>>>>>>> simultaneously
> > > >>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>> Can you elaborate why this is an issue? I am not
> > convinced
> > > >> atm
> > > >>>>>>>>>> that
> > > >>>>>>>>>>>> we
> > > >>>>>>>>>>>>>>> should put the fine grained metrics into INFO level and
> > > >> remove
> > > >>>>>>>>>> the
> > > >>>>>>>>>>>>>>> roll-up at thread level.
> > > >>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>> Given that they have to do this sum to get a usable
> > > >>>>>>>>> top-level
> > > >>>>>>>>>>> view
> > > >>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>> This is a fair concern, but I don't share the
> conclusion.
> > > >>>>>>>>>> Offering
> > > >>>>>>>>>>> a
> > > >>>>>>>>>>>>>>> built-in `KafkaStreams` "client" roll-up out of the box
> > > >> might
> > > >>>>>>>>> be
> > > >>>>>>>>>> a
> > > >>>>>>>>>>>>>>> better solution. In the past we did not offer this due
> to
> > > >>>>>>>>>>> performance
> > > >>>>>>>>>>>>>>> concerns, but we could allow an "opt-in" mechanism. If
> > you
> > > >>>>>>>>>>> disagree,
> > > >>>>>>>>>>>>> can
> > > >>>>>>>>>>>>>>> you provide some reasoning and add them to the
> "Rejected
> > > >>>>>>>>>>>> alternatives"
> > > >>>>>>>>>>>>>>> section.
> > > >>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>> To rephrase: I understand the issue about missing
> > top-level
> > > >>>>>>>>> view,
> > > >>>>>>>>>>> but
> > > >>>>>>>>>>>>>>> instead of going more fine grained, we should consider
> to
> > > >> add
> > > >>>>>>>>>> this
> > > >>>>>>>>>>>>>>> top-level view and add/keep the fine grained metrics at
> > > >> DEBUG
> > > >>>>>>>>>> level
> > > >>>>>>>>>>>>> only
> > > >>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>> I am +1 to add TopologyTestDriver#metrics() and to
> remove
> > > >> old
> > > >>>>>>>>>>> metrics
> > > >>>>>>>>>>>>>>> directly as you suggested.
> > > >>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>> -Matthias
> > > >>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>> On 3/28/18 6:42 PM, Ted Yu wrote:
> > > >>>>>>>>>>>>>>>> Looks good to me.
> > > >>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>> On Wed, Mar 28, 2018 at 3:11 PM, John Roesler <
> > > >>>>>>>>>> john@confluent.io
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>>>> wrote:
> > > >>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>> Hello all,
> > > >>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>> I am proposing KIP-274 to improve the metrics around
> > > >>>>>>>>> skipped
> > > >>>>>>>>>>>> records
> > > >>>>>>>>>>>>>> in
> > > >>>>>>>>>>>>>>>>> Streams.
> > > >>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>> Please find the details here:
> > > >>>>>>>>>>>>>>>>> https://cwiki.apache.org/
> confluence/display/KAFKA/KIP-
> > > >>>>>>>>>>>>>>>>> 274%3A+Kafka+Streams+Skipped+Records+Metrics
> > > >>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>> Please let me know what you think!
> > > >>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>> Thanks,
> > > >>>>>>>>>>>>>>>>> -John
> > > >>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>> --
> > > >>>>>>>>>>>>>> -- Guozhang
> > > >>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>> --
> > > >>>>>>>>>>>> -- Guozhang
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>
> > > >>>>>>>>>>
> > > >>>>>>>>>>
> > > >>>>>>>>>>
> > > >>>>>>>>>> --
> > > >>>>>>>>>> -- Guozhang
> > > >>>>>>>>>>
> > > >>>>>>>>>
> > > >>>>>>>>
> > > >>>>>>>>
> > > >>>>>>>
> > > >>>>>>
> > > >>>>>>
> > > >>>>>
> > > >>>>
> > > >>>>
> > > >>>>
> > > >>>
> > > >>>
> > > >>
> > > >
> > > >
> > > >
> > >
> > >
> >
> >
> > --
> > -- Guozhang
> >
>

Re: [DISCUSS] KIP-274: Kafka Streams Skipped Records Metrics

Posted by John Roesler <jo...@confluent.io>.

Allrighty, how about this, then...

I'll move the metric back to the StreamThread and maintain the existing tag
(client-id=...(per-thread client-id)). It won't be present in the
TopologyTestDriver's metrics.

As a side note, I'm not sure that the location of the log messages has
visibility into the name of the thread or the task, or the processor node,
for that matter. But at the end of the day, I don't think it really matters.

None of those identifiers are in the public interface or user-controlled.
For them to be useful for debugging, users would have to gain a very deep
understanding of how their DSL program gets executed. From my perspective,
they are all included in metric tags only to prevent collisions between the
same metrics in different (e.g.) threads.

I think what's important is to provide the right information in the logs
that users will be able to debug their issues. This is why the logs in my
pr include the topic/partition/offset of the offending data, as well as the
stacktrace of the exception from the deserializer (or for timestamps, the
extracted timestamp and the class name of their extractor). This
information alone should let them pinpoint the offending data and fix it.

(I am aware that that topic name might be a repartition topic, and
therefore also esoteric from the user's perspective, but I think it's the
best we can do right now. It might be nice to explicitly take on a
debugging ergonomics task in the future and give all processor nodes
human-friendly names. Then, we could surface these names in any logs or
exceptions. But I'm inclined to call this out-of-scope for now.)

Thanks again,
-John

On Tue, Apr 3, 2018 at 1:40 PM, Guozhang Wang <wa...@gmail.com> wrote:

> 1. If we can indeed gather all the context information from the log4j
> entries I'd suggest we change to thread-level (I'm not sure if that is
> doable, so if John have already some WIP PR that can help us decide).
>
> 2. We can consider adding the API in TopologyTestDriver for general testing
> purposes; that being said, I think Matthias has a good point that this
> alone should not be a driving motivation for us to keep this metric as
> task-level if 1) is true.
>
>
>
> Guozhang
>
>
> On Tue, Apr 3, 2018 at 11:36 AM, Matthias J. Sax <ma...@confluent.io>
> wrote:
>
> > Thanks Guozhang, that was my intent.
> >
> > @John: yes, we should not nail down the exact log message. It's just to
> > point out the trade-off. If we can get the required information in the
> > logs, we might not need task level metrics.
> >
> >
> > -Matthias
> >
> > On 4/3/18 11:26 AM, Guozhang Wang wrote:
> > > I think Matthias' comment is that, we can still record the metrics on
> the
> > > thread-level, while having the WARN log entry to include sufficient
> > context
> > > information so that users can still easily narrow down the
> investigation
> > > scope.
> > >
> > >
> > > Guozhang
> > >
> > > On Tue, Apr 3, 2018 at 11:22 AM, John Roesler <jo...@confluent.io>
> wrote:
> > >
> > >> I agree we should add as much information as is reasonable to the log.
> > For
> > >> example, see this WIP PR I started for this KIP:
> > >>
> > >> https://github.com/apache/kafka/pull/4812/files#diff-
> > >> 88d129f048bc842c7db5b2566a45fce8R80
> > >>
> > >> and
> > >>
> > >> https://github.com/apache/kafka/pull/4812/files#diff-
> > >> 69e6789eb675ec978a1abd24fed96eb1R111
> > >>
> > >> I'm not sure if we should nail down the log messages in the KIP or in
> > the
> > >> PR discussion. What say you?
> > >>
> > >> Thanks,
> > >> -John
> > >>
> > >> On Tue, Apr 3, 2018 at 12:20 AM, Matthias J. Sax <
> matthias@confluent.io
> > >
> > >> wrote:
> > >>
> > >>> Thanks for sharing your thoughts. As I mentioned originally, I am not
> > >>> sure about the right log level either. Your arguments are convincing
> --
> > >>> thus, I am fine with keeping WARN level.
> > >>>
> > >>> The task vs thread level argument is an interesting one. However, I
> am
> > >>> wondering if we should add this information into the corresponding
> WARN
> > >>> logs that we write anyway? For this case, we can also log the
> > >>> corresponding operator (and other information like topic name etc if
> > >>> needed). WDYT about this?
> > >>>
> > >>>
> > >>> -Matthias
> > >>>
> > >>> On 4/2/18 8:31 PM, Guozhang Wang wrote:
> > >>>> Regarding logging: I'm inclined to keep logging at WARN level since
> > >>> skipped
> > >>>> records are not expected in normal execution (for all reasons that
> we
> > >> are
> > >>>> aware of), and hence when error happens users should be alerted from
> > >>>> metrics and looked into the log files, so to me if it is really
> > >> spamming
> > >>>> the log files it is also a good alert for users. Besides for
> > >> deserialize
> > >>>> errors we already log at WARN level for this reason.
> > >>>>
> > >>>> Regarding the metrics-levels: I was pondering on that as well. What
> > >> made
> > >>> me
> > >>>> to think and agree on task-level than thread-level is that for some
> > >>> reasons
> > >>>> like window retention, they may possibly be happening on a subset of
> > >>> input
> > >>>> partitions, and tasks are correlated with partitions the task-level
> > >>> metrics
> > >>>> can help users to narrow down on the specific input data partitions.
> > >>>>
> > >>>>
> > >>>> Guozhang
> > >>>>
> > >>>>
> > >>>> On Mon, Apr 2, 2018 at 6:43 PM, John Roesler <jo...@confluent.io>
> > >> wrote:
> > >>>>
> > >>>>> Hi Matthias,
> > >>>>>
> > >>>>> No worries! Thanks for the reply.
> > >>>>>
> > >>>>> 1) There isn't a connection. I tried using the TopologyTestDriver
> to
> > >>> write
> > >>>>> a quick test exercising the current behavior and discovered that
> the
> > >>>>> metrics weren't available. It seemed like they should be, so I
> tacked
> > >>> it on
> > >>>>> to this KIP. If you feel it's inappropriate, I can pull it back
> out.
> > >>>>>
> > >>>>> 2) I was also concerned about that, but I figured it would come up
> in
> > >>>>> discussion if I just went ahead and proposed it. And here we are!
> > >>>>>
> > >>>>> Here's my thought: maybe there are two classes of skips:
> "controlled"
> > >>> and
> > >>>>> "uncontrolled", where "controlled" means, as an app author, I
> > >>> deliberately
> > >>>>> filter out some events, and "uncontrolled" means that I simply
> don't
> > >>>>> account for some feature of the data, and the framework skips them
> > (as
> > >>>>> opposed to crashing).
> > >>>>>
> > >>>>> In this breakdowns, the skips I'm adding metrics for are all
> > >>> uncontrolled
> > >>>>> skips (and we hope to measure all the uncontrolled skips). Our
> skips
> > >> are
> > >>>>> well documented, so it wouldn't be terrible to have an application
> in
> > >>> which
> > >>>>> you know you expect to have tons of uncontrolled skips, but it's
> not
> > >>> great
> > >>>>> either, since you may also have some *unexpected* uncontrolled
> skips.
> > >>> It'll
> > >>>>> be difficult to notice, since you're probably not alerting on the
> > >> metric
> > >>>>> and filtering out the logs (whatever their level).
> > >>>>>
> > >>>>> I'd recommend any app author, as an alternative, to convert all
> > >> expected
> > >>>>> skips to controlled ones, by updating the topology to filter those
> > >>> records
> > >>>>> out.
> > >>>>>
> > >>>>> Following from my recommendation, as a library author, I'm inclined
> > to
> > >>> mark
> > >>>>> those logs WARN, since in my opinion, they should be concerning to
> > the
> > >>> app
> > >>>>> authors. I'd definitely want to show, rather than hide, them by
> > >>> default, so
> > >>>>> I would pick INFO at least.
> > >>>>>
> > >>>>> That said, logging is always a tricky issue for lower-level
> libraries
> > >>> that
> > >>>>> run inside user code, since we don't have all the information we
> need
> > >> to
> > >>>>> make the right call.
> > >>>>>
> > >>>>>
> > >>>>>
> > >>>>> On your last note, yeah, I got that impression from Guozhang as
> well.
> > >>>>> Thanks for the clarification.
> > >>>>>
> > >>>>> -John
> > >>>>>
> > >>>>>
> > >>>>>
> > >>>>> On Mon, Apr 2, 2018 at 4:03 PM, Matthias J. Sax <
> > >> matthias@confluent.io>
> > >>>>> wrote:
> > >>>>>
> > >>>>>> John,
> > >>>>>>
> > >>>>>> sorry for my late reply and thanks for updating the KIP.
> > >>>>>>
> > >>>>>> I like your approach about "metrics are for monitoring, logs are
> for
> > >>>>>> debugging" -- however:
> > >>>>>>
> > >>>>>> 1) I don't see a connection between this and the task-level
> metrics
> > >>> that
> > >>>>>> you propose to get the metrics in `TopologyTestDriver`. I don't
> > think
> > >>>>>> people would monitor the `TopologyTestDriver` an thus wondering
> why
> > >> it
> > >>>>>> is important to include the metrics there? Thread-level metric
> might
> > >> be
> > >>>>>> easier to monitor though (ie, less different metric to monitor).
> > >>>>>>
> > >>>>>> 2) I am a little worried about WARN level logging and that it
> might
> > >> be
> > >>>>>> too chatty -- as you pointed out, it's about debugging, thus DEBUG
> > >>> level
> > >>>>>> might be better. Not 100% sure about this to be honest. What is
> the
> > >>>>>> general assumption about the frequency for skipped records? I
> could
> > >>>>>> imagine cases for which skipped records are quite frequent and
> thus,
> > >>>>>> WARN level logs might "flood" the logs
> > >>>>>>
> > >>>>>> One final remark:
> > >>>>>>
> > >>>>>>> More
> > >>>>>>> generally, I would like to establish a pattern in which we could
> > add
> > >>>>> new
> > >>>>>>> values for the "reason" tags without needing a KIP to do so.
> > >>>>>>
> > >>>>>> From my understanding, this is not feasible. Changing metrics is
> > >> always
> > >>>>>> considered a public API change, and we need a KIP for any change.
> As
> > >> we
> > >>>>>> moved away from tagging, it doesn't matter for the KIP anymore --
> > >> just
> > >>>>>> wanted to point it out.
> > >>>>>>
> > >>>>>>
> > >>>>>> -Matthias
> > >>>>>>
> > >>>>>>
> > >>>>>> On 3/30/18 2:47 PM, John Roesler wrote:
> > >>>>>>> Allrighty! The KIP is updated.
> > >>>>>>>
> > >>>>>>> Thanks again, all, for the feedback.
> > >>>>>>> -John
> > >>>>>>>
> > >>>>>>> On Fri, Mar 30, 2018 at 3:35 PM, John Roesler <john@confluent.io
> >
> > >>>>> wrote:
> > >>>>>>>
> > >>>>>>>> Hey Guozhang and Bill,
> > >>>>>>>>
> > >>>>>>>> Ok, I'll update the KIP. At the risk of disturbing consensus,
> I'd
> > >>> like
> > >>>>>> to
> > >>>>>>>> put it in the task instead of the thread so that it'll show up
> in
> > >> the
> > >>>>>>>> TopologyTestDriver metrics as well.
> > >>>>>>>>
> > >>>>>>>> I'm leaning toward keeping the scope where it is right now, but
> if
> > >>>>>> others
> > >>>>>>>> want to advocate for tossing in some more metrics, we can go
> that
> > >>>>> route.
> > >>>>>>>>
> > >>>>>>>> Thanks all,
> > >>>>>>>> -John
> > >>>>>>>>
> > >>>>>>>> On Fri, Mar 30, 2018 at 2:37 PM, Bill Bejeck <bbejeck@gmail.com
> >
> > >>>>> wrote:
> > >>>>>>>>
> > >>>>>>>>> Thanks for the KIP John, and sorry for the late comments.
> > >>>>>>>>>
> > >>>>>>>>> I'm on the fence with providing a single level metrics, but I
> > >> think
> > >>>>>> we'll
> > >>>>>>>>> have that discussion outside of this KIP.
> > >>>>>>>>>
> > >>>>>>>>>> * maintain one skipped-record metric (could be per-thread,
> > >>> per-task,
> > >>>>>> or
> > >>>>>>>>>> per-processor-node) with no "reason"
> > >>>>>>>>>> * introduce a warn-level log detailing the
> > topic/partition/offset
> > >>>>> and
> > >>>>>>>>>> reason of the skipped record
> > >>>>>>>>>
> > >>>>>>>>> I'm +1 on both of these suggestions.
> > >>>>>>>>>
> > >>>>>>>>> Finally, we have had requests in the past for some metrics
> around
> > >>>>> when
> > >>>>>>>>> persistent store removes an expired window.  Would adding that
> to
> > >>> our
> > >>>>>>>>> metrics stretch the scope of this KIP too much?
> > >>>>>>>>>
> > >>>>>>>>> Thanks again and overall I'm +1 on this KIP
> > >>>>>>>>>
> > >>>>>>>>> Bill
> > >>>>>>>>>
> > >>>>>>>>> On Fri, Mar 30, 2018 at 2:00 PM, Guozhang Wang <
> > >> wangguoz@gmail.com>
> > >>>>>>>>> wrote:
> > >>>>>>>>>
> > >>>>>>>>>> The proposal sounds good to me. About "maintain only one level
> > of
> > >>>>>>>>> metrics"
> > >>>>>>>>>> maybe we can discuss about that separately from this KIP since
> > >> that
> > >>>>>>>>> would
> > >>>>>>>>>> be a larger scope of discussion. I agree that if we are going
> to
> > >>>>>>>>> maintain
> > >>>>>>>>>> only one-level metrics it should be lowest level and we would
> > let
> > >>>>>> users
> > >>>>>>>>> to
> > >>>>>>>>>> do the roll-ups themselves, but I'm still not fully convinced
> > >> that
> > >>>>> we
> > >>>>>>>>>> should just provide single-level metrics, because 1) I think
> for
> > >>>>>>>>> different
> > >>>>>>>>>> metrics people may be interested to investigate into different
> > >>>>>>>>>> granularities, e.g. for poll / commit rate these are at the
> > >> lowest
> > >>>>>>>>>> task-level metrics, while for process-rate / skip-rate they
> can
> > >> be
> > >>>>> as
> > >>>>>>>>> low
> > >>>>>>>>>> as processor-node metrics, and 2) user-side rolling ups may
> not
> > >> be
> > >>>>>> very
> > >>>>>>>>>> straight-forward. But for 2) if someone can provide an
> efficient
> > >>> and
> > >>>>>>>>> easy
> > >>>>>>>>>> implementation of that I can be persuaded :)
> > >>>>>>>>>>
> > >>>>>>>>>> For now I'm thinking we can add the metric on thread-level,
> > >> either
> > >>>>>> with
> > >>>>>>>>>> finer grained ones with "reason" tag plus an aggregated one
> > >> without
> > >>>>>> the
> > >>>>>>>>>> tag, or just having a single aggregated metric without the tag
> > >>> looks
> > >>>>>>>>> good
> > >>>>>>>>>> to me.
> > >>>>>>>>>>
> > >>>>>>>>>>
> > >>>>>>>>>> Guozhang
> > >>>>>>>>>>
> > >>>>>>>>>> On Fri, Mar 30, 2018 at 8:05 AM, John Roesler <
> > john@confluent.io
> > >>>
> > >>>>>>>>> wrote:
> > >>>>>>>>>>
> > >>>>>>>>>>> Hey Guozhang,
> > >>>>>>>>>>>
> > >>>>>>>>>>> Thanks for the reply. Regarding JMX, I can dig it. I'll
> provide
> > >> a
> > >>>>>>>>> list in
> > >>>>>>>>>>> the KIP. I was also thinking we'd better start a
> documentation
> > >>> page
> > >>>>>>>>> with
> > >>>>>>>>>>> the metrics listed.
> > >>>>>>>>>>>
> > >>>>>>>>>>> I'd have no problem logging a warning when we skip records.
> On
> > >> the
> > >>>>>>>>> metric
> > >>>>>>>>>>> front, really I'm just pushing for us to maintain only one
> > level
> > >>> of
> > >>>>>>>>>>> metrics. If that's more or less granular (i.e., maybe we
> don't
> > >>>>> have a
> > >>>>>>>>>>> metric per reason and log the reason instead), that's fine by
> > >> me.
> > >>> I
> > >>>>>>>>> just
> > >>>>>>>>>>> don't think it provides a lot of extra value per complexity
> > >>>>>> (interface
> > >>>>>>>>>> and
> > >>>>>>>>>>> implementation) to maintain roll-ups at the thread level in
> > >>>>> addition
> > >>>>>>>>> to
> > >>>>>>>>>>> lower-level metrics.
> > >>>>>>>>>>>
> > >>>>>>>>>>> How about this instead:
> > >>>>>>>>>>> * maintain one skipped-record metric (could be per-thread,
> > >>>>> per-task,
> > >>>>>>>>> or
> > >>>>>>>>>>> per-processor-node) with no "reason"
> > >>>>>>>>>>> * introduce a warn-level log detailing the
> > >> topic/partition/offset
> > >>>>> and
> > >>>>>>>>>>> reason of the skipped record
> > >>>>>>>>>>>
> > >>>>>>>>>>> If you like that, I can update the KIP.
> > >>>>>>>>>>>
> > >>>>>>>>>>> Thanks,
> > >>>>>>>>>>> -John
> > >>>>>>>>>>>
> > >>>>>>>>>>>
> > >>>>>>>>>>>
> > >>>>>>>>>>> On Thu, Mar 29, 2018 at 6:22 PM, Guozhang Wang <
> > >>> wangguoz@gmail.com
> > >>>>>>
> > >>>>>>>>>> wrote:
> > >>>>>>>>>>>
> > >>>>>>>>>>>>> One thing you mention is the notion of setting alerts on
> > >> coarser
> > >>>>>>>>>>> metrics
> > >>>>>>>>>>>> being easier than finer ones. All the metric alerting
> systems
> > I
> > >>>>> have
> > >>>>>>>>>> used
> > >>>>>>>>>>>> make it equally easy to alert on metrics by-tag or over
> tags.
> > >> So
> > >>>>> my
> > >>>>>>>>>>>> experience doesn't say that this is a use case. Were you
> > >> thinking
> > >>>>>>>>> of an
> > >>>>>>>>>>>> alerting system that makes such a pre-aggregation valuable?
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> For the commonly used JMX reporter tags will be encoded
> > >> directly
> > >>>>> as
> > >>>>>>>>>> part
> > >>>>>>>>>>> of
> > >>>>>>>>>>>> the object name, and if users wants to monitor them they
> need
> > >> to
> > >>>>>>>>> know
> > >>>>>>>>>>> these
> > >>>>>>>>>>>> values before hand. That is also why I think we do want to
> > list
> > >>>>> all
> > >>>>>>>>> the
> > >>>>>>>>>>>> possible values of the reason tags in the KIP, since
> > >>>>>>>>>>>>
> > >>>>>>>>>>>>> In my email in response to Matthias, I gave an example of
> the
> > >>>>>>>>> kind of
> > >>>>>>>>>>>> scenario that would lead me as an operator to run with DEBUG
> > on
> > >>>>> all
> > >>>>>>>>> the
> > >>>>>>>>>>>> time, since I wouldn't be sure, having seen a skipped record
> > >>> once,
> > >>>>>>>>> that
> > >>>>>>>>>>> it
> > >>>>>>>>>>>> would ever happen again. The solution is to capture all the
> > >>>>>>>>> available
> > >>>>>>>>>>>> information about the reason and location of skips all the
> > >> time.
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> That is a good point. I think we can either expose all
> levels
> > >>>>>>>>> metrics
> > >>>>>>>>>> as
> > >>>>>>>>>>> by
> > >>>>>>>>>>>> default, or only expose the most lower-level metrics and get
> > >> rid
> > >>>>> of
> > >>>>>>>>>> other
> > >>>>>>>>>>>> levels to let users do roll-ups themselves (which will be a
> > >> much
> > >>>>>>>>> larger
> > >>>>>>>>>>>> scope for discussion), or we can encourage users to not
> purely
> > >>>>>>>>> depend
> > >>>>>>>>>> on
> > >>>>>>>>>>>> metrics for such trouble shooting: that is to say, users
> only
> > >> be
> > >>>>>>>>>> alerted
> > >>>>>>>>>>>> based on metrics, and we can log a info / warn log4j entry
> > each
> > >>>>>>>>> time we
> > >>>>>>>>>>> are
> > >>>>>>>>>>>> about to skip a record all over the places, so that upon
> being
> > >>>>>>>>> notified
> > >>>>>>>>>>>> users can look into the logs to find the details on where /
> > >> when
> > >>>>> it
> > >>>>>>>>>>>> happens. WDYT?
> > >>>>>>>>>>>>
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> Guozhang
> > >>>>>>>>>>>>
> > >>>>>>>>>>>>
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> On Thu, Mar 29, 2018 at 3:57 PM, John Roesler <
> > >> john@confluent.io
> > >>>>
> > >>>>>>>>>> wrote:
> > >>>>>>>>>>>>
> > >>>>>>>>>>>>> Hey Guozhang,
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> Thanks for the review.
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> 1.
> > >>>>>>>>>>>>> Matthias raised the same question about the "reason" tag
> > >> values.
> > >>>>> I
> > >>>>>>>>>> can
> > >>>>>>>>>>>> list
> > >>>>>>>>>>>>> all possible values of the "reason" tag, but I'm thinking
> > this
> > >>>>>>>>> level
> > >>>>>>>>>> of
> > >>>>>>>>>>>>> detail may not be KIP-worthy, maybe the code and
> > documentation
> > >>>>>>>>> review
> > >>>>>>>>>>>> would
> > >>>>>>>>>>>>> be sufficient. If you all disagree and would like it
> included
> > >> in
> > >>>>>>>>> the
> > >>>>>>>>>>>> KIP, I
> > >>>>>>>>>>>>> can certainly do that.
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> If we do provide roll-up metrics, I agree with the pattern
> of
> > >>>>>>>>> keeping
> > >>>>>>>>>>> the
> > >>>>>>>>>>>>> same name but eliminating the tags for the dimensions that
> > >> were
> > >>>>>>>>>>>> rolled-up.
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> 2.
> > >>>>>>>>>>>>> I'm not too sure that implementation efficiency really
> > >> becomes a
> > >>>>>>>>>> factor
> > >>>>>>>>>>>> in
> > >>>>>>>>>>>>> choosing whether to (by default) update one coarse metric
> at
> > >> the
> > >>>>>>>>>> thread
> > >>>>>>>>>>>>> level or one granular metric at the processor-node level,
> > >> since
> > >>>>>>>>> it's
> > >>>>>>>>>>> just
> > >>>>>>>>>>>>> one metric being updated either way. I do agree that if we
> > >> were
> > >>>>> to
> > >>>>>>>>>>> update
> > >>>>>>>>>>>>> the granular metrics and multiple roll-ups, then we should
> > >>>>>>>>> consider
> > >>>>>>>>>> the
> > >>>>>>>>>>>>> efficiency.
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> I agree it's probably not necessary to surface the metrics
> > for
> > >>>>> all
> > >>>>>>>>>>> nodes
> > >>>>>>>>>>>>> regardless of whether they can or do skip records. Perhaps
> we
> > >>> can
> > >>>>>>>>>>> lazily
> > >>>>>>>>>>>>> register the metrics.
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> In my email in response to Matthias, I gave an example of
> the
> > >>>>>>>>> kind of
> > >>>>>>>>>>>>> scenario that would lead me as an operator to run with
> DEBUG
> > >> on
> > >>>>>>>>> all
> > >>>>>>>>>> the
> > >>>>>>>>>>>>> time, since I wouldn't be sure, having seen a skipped
> record
> > >>>>> once,
> > >>>>>>>>>> that
> > >>>>>>>>>>>> it
> > >>>>>>>>>>>>> would ever happen again. The solution is to capture all the
> > >>>>>>>>> available
> > >>>>>>>>>>>>> information about the reason and location of skips all the
> > >> time.
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> One thing you mention is the notion of setting alerts on
> > >> coarser
> > >>>>>>>>>>> metrics
> > >>>>>>>>>>>>> being easier than finer ones. All the metric alerting
> systems
> > >> I
> > >>>>>>>>> have
> > >>>>>>>>>>> used
> > >>>>>>>>>>>>> make it equally easy to alert on metrics by-tag or over
> tags.
> > >> So
> > >>>>>>>>> my
> > >>>>>>>>>>>>> experience doesn't say that this is a use case. Were you
> > >>> thinking
> > >>>>>>>>> of
> > >>>>>>>>>> an
> > >>>>>>>>>>>>> alerting system that makes such a pre-aggregation valuable?
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> Thanks again,
> > >>>>>>>>>>>>> -John
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> On Thu, Mar 29, 2018 at 5:24 PM, Guozhang Wang <
> > >>>>>>>>> wangguoz@gmail.com>
> > >>>>>>>>>>>> wrote:
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>>> Hello John,
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> Thanks for the KIP. Some comments:
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> 1. Could you list all the possible values of the "reason"
> > >> tag?
> > >>>>>>>>> In
> > >>>>>>>>>> the
> > >>>>>>>>>>>>> JIRA
> > >>>>>>>>>>>>>> ticket I left some potential reasons but I'm not clear if
> > >>> you're
> > >>>>>>>>>>> going
> > >>>>>>>>>>>> to
> > >>>>>>>>>>>>>> categorize each of them as a separate reason, or is there
> > any
> > >>>>>>>>>>>> additional
> > >>>>>>>>>>>>>> ones you have in mind.
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> Also I'm wondering if we should add another metric that do
> > >> not
> > >>>>>>>>> have
> > >>>>>>>>>>> the
> > >>>>>>>>>>>>>> reason tag but aggregates among all possible reasons? This
> > is
> > >>>>>>>>> for
> > >>>>>>>>>>> users
> > >>>>>>>>>>>>> to
> > >>>>>>>>>>>>>> easily set their alerting notifications (otherwise they
> have
> > >> to
> > >>>>>>>>>> write
> > >>>>>>>>>>>> on
> > >>>>>>>>>>>>>> notification rule per reason) in their monitoring systems.
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> 2. Note that the processor-node metrics is actually
> > >>> "per-thread,
> > >>>>>>>>>>>>> per-task,
> > >>>>>>>>>>>>>> per-processor-node", and today we only set the per-thread
> > >>>>>>>>> metrics
> > >>>>>>>>>> as
> > >>>>>>>>>>>> INFO
> > >>>>>>>>>>>>>> while leaving the lower two layers as DEBUG. I agree with
> > >> your
> > >>>>>>>>>>> argument
> > >>>>>>>>>>>>>> that we are missing the per-client roll-up metrics today,
> > but
> > >>>>>>>>> I'm
> > >>>>>>>>>>>>> convinced
> > >>>>>>>>>>>>>> that the right way to approach it would be
> > >>>>>>>>>>> "just-providing-the-lowest-
> > >>>>>>>>>>>>>> level
> > >>>>>>>>>>>>>> metrics only".
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> Note the recoding implementation of these three levels are
> > >>>>>>>>>> different
> > >>>>>>>>>>>>>> internally today: we did not just do the rolling up to
> > >> generate
> > >>>>>>>>> the
> > >>>>>>>>>>>>>> higher-level metrics from the lower level ones, but we
> just
> > >>>>>>>>> record
> > >>>>>>>>>>> them
> > >>>>>>>>>>>>>> separately, which means that, if we turn on multiple
> levels
> > >> of
> > >>>>>>>>>>> metrics,
> > >>>>>>>>>>>>> we
> > >>>>>>>>>>>>>> maybe duplicate collecting some metrics. One can argue
> that
> > >> is
> > >>>>>>>>> not
> > >>>>>>>>>>> the
> > >>>>>>>>>>>>> best
> > >>>>>>>>>>>>>> way to represent multi-level metrics collecting and
> > >> reporting,
> > >>>>>>>>> but
> > >>>>>>>>>> by
> > >>>>>>>>>>>>> only
> > >>>>>>>>>>>>>> enabling thread-level metrics as INFO today, that
> > >>> implementation
> > >>>>>>>>>>> could
> > >>>>>>>>>>>> be
> > >>>>>>>>>>>>>> more efficient than only collecting the metrics at the
> > lowest
> > >>>>>>>>>> level,
> > >>>>>>>>>>>> and
> > >>>>>>>>>>>>>> then do the roll-up calculations outside of the metrics
> > >>> classes.
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> Plus, today not all processor-nodes may possibly skip
> > >> records,
> > >>>>>>>>>> AFAIK
> > >>>>>>>>>>> we
> > >>>>>>>>>>>>>> will only skip records at the source, sink, window and
> > >>>>>>>>> aggregation
> > >>>>>>>>>>>>>> processor nodes, so adding a metric per processor looks
> like
> > >> an
> > >>>>>>>>>>>> overkill
> > >>>>>>>>>>>>> to
> > >>>>>>>>>>>>>> me as well. On the other hand, from user's perspective the
> > >>>>>>>>> "reason"
> > >>>>>>>>>>> tag
> > >>>>>>>>>>>>> may
> > >>>>>>>>>>>>>> be sufficient for them to narrow down where inside the
> > >> topology
> > >>>>>>>>> is
> > >>>>>>>>>>>>> causing
> > >>>>>>>>>>>>>> records to be dropped on the floor. So I think the
> > >> "per-thread,
> > >>>>>>>>>>>> per-task"
> > >>>>>>>>>>>>>> level metrics should be sufficient for them in trouble
> shoot
> > >> in
> > >>>>>>>>>> DEBUG
> > >>>>>>>>>>>>> mode,
> > >>>>>>>>>>>>>> and we can add another "per-thread" level metrics as INFO
> > >> which
> > >>>>>>>>> is
> > >>>>>>>>>>>> turned
> > >>>>>>>>>>>>>> on by default. So under normal execution users still only
> > >> need
> > >>>>>>>>> INFO
> > >>>>>>>>>>>> level
> > >>>>>>>>>>>>>> metrics for alerting (e.g. set alerts on all
> skipped-records
> > >>>>>>>>>> metrics
> > >>>>>>>>>>> as
> > >>>>>>>>>>>>>> non-zero), and then upon trouble shooting they can turn on
> > >>> DEBUG
> > >>>>>>>>>>>> metrics
> > >>>>>>>>>>>>> to
> > >>>>>>>>>>>>>> look into which task is actually causing the skipped
> > records.
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> Guozhang
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> On Thu, Mar 29, 2018 at 2:03 PM, Matthias J. Sax <
> > >>>>>>>>>>>> matthias@confluent.io>
> > >>>>>>>>>>>>>> wrote:
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> Thanks for the KIP John.
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> Reading the material on the related Jiras, I am wondering
> > >> what
> > >>>>>>>>>>>> `reason`
> > >>>>>>>>>>>>>>> tags you want to introduce? Can you elaborate? The KIP
> > >> should
> > >>>>>>>>>> list
> > >>>>>>>>>>>>> those
> > >>>>>>>>>>>>>>> IMHO.
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> About the fine grained metrics vs the roll-up: you say
> that
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> the coarse metric aggregates across two dimensions
> > >>>>>>>>>> simultaneously
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> Can you elaborate why this is an issue? I am not
> convinced
> > >> atm
> > >>>>>>>>>> that
> > >>>>>>>>>>>> we
> > >>>>>>>>>>>>>>> should put the fine grained metrics into INFO level and
> > >> remove
> > >>>>>>>>>> the
> > >>>>>>>>>>>>>>> roll-up at thread level.
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> Given that they have to do this sum to get a usable
> > >>>>>>>>> top-level
> > >>>>>>>>>>> view
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> This is a fair concern, but I don't share the conclusion.
> > >>>>>>>>>> Offering
> > >>>>>>>>>>> a
> > >>>>>>>>>>>>>>> built-in `KafkaStreams` "client" roll-up out of the box
> > >> might
> > >>>>>>>>> be
> > >>>>>>>>>> a
> > >>>>>>>>>>>>>>> better solution. In the past we did not offer this due to
> > >>>>>>>>>>> performance
> > >>>>>>>>>>>>>>> concerns, but we could allow an "opt-in" mechanism. If
> you
> > >>>>>>>>>>> disagree,
> > >>>>>>>>>>>>> can
> > >>>>>>>>>>>>>>> you provide some reasoning and add them to the "Rejected
> > >>>>>>>>>>>> alternatives"
> > >>>>>>>>>>>>>>> section.
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> To rephrase: I understand the issue about missing
> top-level
> > >>>>>>>>> view,
> > >>>>>>>>>>> but
> > >>>>>>>>>>>>>>> instead of going more fine grained, we should consider to
> > >> add
> > >>>>>>>>>> this
> > >>>>>>>>>>>>>>> top-level view and add/keep the fine grained metrics at
> > >> DEBUG
> > >>>>>>>>>> level
> > >>>>>>>>>>>>> only
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> I am +1 to add TopologyTestDriver#metrics() and to remove
> > >> old
> > >>>>>>>>>>> metrics
> > >>>>>>>>>>>>>>> directly as you suggested.
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> -Matthias
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> On 3/28/18 6:42 PM, Ted Yu wrote:
> > >>>>>>>>>>>>>>>> Looks good to me.
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> On Wed, Mar 28, 2018 at 3:11 PM, John Roesler <
> > >>>>>>>>>> john@confluent.io
> > >>>>>>>>>>>>
> > >>>>>>>>>>>>>> wrote:
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> Hello all,
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> I am proposing KIP-274 to improve the metrics around
> > >>>>>>>>> skipped
> > >>>>>>>>>>>> records
> > >>>>>>>>>>>>>> in
> > >>>>>>>>>>>>>>>>> Streams.
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> Please find the details here:
> > >>>>>>>>>>>>>>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-
> > >>>>>>>>>>>>>>>>> 274%3A+Kafka+Streams+Skipped+Records+Metrics
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> Please let me know what you think!
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> Thanks,
> > >>>>>>>>>>>>>>>>> -John
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> --
> > >>>>>>>>>>>>>> -- Guozhang
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>
> > >>>>>>>>>>>>
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> --
> > >>>>>>>>>>>> -- Guozhang
> > >>>>>>>>>>>>
> > >>>>>>>>>>>
> > >>>>>>>>>>
> > >>>>>>>>>>
> > >>>>>>>>>>
> > >>>>>>>>>> --
> > >>>>>>>>>> -- Guozhang
> > >>>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>>
> > >>>>>>
> > >>>>>>
> > >>>>>
> > >>>>
> > >>>>
> > >>>>
> > >>>
> > >>>
> > >>
> > >
> > >
> > >
> >
> >
>
>
> --
> -- Guozhang
>

Re: [DISCUSS] KIP-274: Kafka Streams Skipped Records Metrics

Posted by Guozhang Wang <wa...@gmail.com>.

1. If we can indeed gather all the context information from the log4j
entries I'd suggest we change to thread-level (I'm not sure if that is
doable, so if John have already some WIP PR that can help us decide).

2. We can consider adding the API in TopologyTestDriver for general testing
purposes; that being said, I think Matthias has a good point that this
alone should not be a driving motivation for us to keep this metric as
task-level if 1) is true.



Guozhang


On Tue, Apr 3, 2018 at 11:36 AM, Matthias J. Sax <ma...@confluent.io>
wrote:

> Thanks Guozhang, that was my intent.
>
> @John: yes, we should not nail down the exact log message. It's just to
> point out the trade-off. If we can get the required information in the
> logs, we might not need task level metrics.
>
>
> -Matthias
>
> On 4/3/18 11:26 AM, Guozhang Wang wrote:
> > I think Matthias' comment is that, we can still record the metrics on the
> > thread-level, while having the WARN log entry to include sufficient
> context
> > information so that users can still easily narrow down the investigation
> > scope.
> >
> >
> > Guozhang
> >
> > On Tue, Apr 3, 2018 at 11:22 AM, John Roesler <jo...@confluent.io> wrote:
> >
> >> I agree we should add as much information as is reasonable to the log.
> For
> >> example, see this WIP PR I started for this KIP:
> >>
> >> https://github.com/apache/kafka/pull/4812/files#diff-
> >> 88d129f048bc842c7db5b2566a45fce8R80
> >>
> >> and
> >>
> >> https://github.com/apache/kafka/pull/4812/files#diff-
> >> 69e6789eb675ec978a1abd24fed96eb1R111
> >>
> >> I'm not sure if we should nail down the log messages in the KIP or in
> the
> >> PR discussion. What say you?
> >>
> >> Thanks,
> >> -John
> >>
> >> On Tue, Apr 3, 2018 at 12:20 AM, Matthias J. Sax <matthias@confluent.io
> >
> >> wrote:
> >>
> >>> Thanks for sharing your thoughts. As I mentioned originally, I am not
> >>> sure about the right log level either. Your arguments are convincing --
> >>> thus, I am fine with keeping WARN level.
> >>>
> >>> The task vs thread level argument is an interesting one. However, I am
> >>> wondering if we should add this information into the corresponding WARN
> >>> logs that we write anyway? For this case, we can also log the
> >>> corresponding operator (and other information like topic name etc if
> >>> needed). WDYT about this?
> >>>
> >>>
> >>> -Matthias
> >>>
> >>> On 4/2/18 8:31 PM, Guozhang Wang wrote:
> >>>> Regarding logging: I'm inclined to keep logging at WARN level since
> >>> skipped
> >>>> records are not expected in normal execution (for all reasons that we
> >> are
> >>>> aware of), and hence when error happens users should be alerted from
> >>>> metrics and looked into the log files, so to me if it is really
> >> spamming
> >>>> the log files it is also a good alert for users. Besides for
> >> deserialize
> >>>> errors we already log at WARN level for this reason.
> >>>>
> >>>> Regarding the metrics-levels: I was pondering on that as well. What
> >> made
> >>> me
> >>>> to think and agree on task-level than thread-level is that for some
> >>> reasons
> >>>> like window retention, they may possibly be happening on a subset of
> >>> input
> >>>> partitions, and tasks are correlated with partitions the task-level
> >>> metrics
> >>>> can help users to narrow down on the specific input data partitions.
> >>>>
> >>>>
> >>>> Guozhang
> >>>>
> >>>>
> >>>> On Mon, Apr 2, 2018 at 6:43 PM, John Roesler <jo...@confluent.io>
> >> wrote:
> >>>>
> >>>>> Hi Matthias,
> >>>>>
> >>>>> No worries! Thanks for the reply.
> >>>>>
> >>>>> 1) There isn't a connection. I tried using the TopologyTestDriver to
> >>> write
> >>>>> a quick test exercising the current behavior and discovered that the
> >>>>> metrics weren't available. It seemed like they should be, so I tacked
> >>> it on
> >>>>> to this KIP. If you feel it's inappropriate, I can pull it back out.
> >>>>>
> >>>>> 2) I was also concerned about that, but I figured it would come up in
> >>>>> discussion if I just went ahead and proposed it. And here we are!
> >>>>>
> >>>>> Here's my thought: maybe there are two classes of skips: "controlled"
> >>> and
> >>>>> "uncontrolled", where "controlled" means, as an app author, I
> >>> deliberately
> >>>>> filter out some events, and "uncontrolled" means that I simply don't
> >>>>> account for some feature of the data, and the framework skips them
> (as
> >>>>> opposed to crashing).
> >>>>>
> >>>>> In this breakdowns, the skips I'm adding metrics for are all
> >>> uncontrolled
> >>>>> skips (and we hope to measure all the uncontrolled skips). Our skips
> >> are
> >>>>> well documented, so it wouldn't be terrible to have an application in
> >>> which
> >>>>> you know you expect to have tons of uncontrolled skips, but it's not
> >>> great
> >>>>> either, since you may also have some *unexpected* uncontrolled skips.
> >>> It'll
> >>>>> be difficult to notice, since you're probably not alerting on the
> >> metric
> >>>>> and filtering out the logs (whatever their level).
> >>>>>
> >>>>> I'd recommend any app author, as an alternative, to convert all
> >> expected
> >>>>> skips to controlled ones, by updating the topology to filter those
> >>> records
> >>>>> out.
> >>>>>
> >>>>> Following from my recommendation, as a library author, I'm inclined
> to
> >>> mark
> >>>>> those logs WARN, since in my opinion, they should be concerning to
> the
> >>> app
> >>>>> authors. I'd definitely want to show, rather than hide, them by
> >>> default, so
> >>>>> I would pick INFO at least.
> >>>>>
> >>>>> That said, logging is always a tricky issue for lower-level libraries
> >>> that
> >>>>> run inside user code, since we don't have all the information we need
> >> to
> >>>>> make the right call.
> >>>>>
> >>>>>
> >>>>>
> >>>>> On your last note, yeah, I got that impression from Guozhang as well.
> >>>>> Thanks for the clarification.
> >>>>>
> >>>>> -John
> >>>>>
> >>>>>
> >>>>>
> >>>>> On Mon, Apr 2, 2018 at 4:03 PM, Matthias J. Sax <
> >> matthias@confluent.io>
> >>>>> wrote:
> >>>>>
> >>>>>> John,
> >>>>>>
> >>>>>> sorry for my late reply and thanks for updating the KIP.
> >>>>>>
> >>>>>> I like your approach about "metrics are for monitoring, logs are for
> >>>>>> debugging" -- however:
> >>>>>>
> >>>>>> 1) I don't see a connection between this and the task-level metrics
> >>> that
> >>>>>> you propose to get the metrics in `TopologyTestDriver`. I don't
> think
> >>>>>> people would monitor the `TopologyTestDriver` an thus wondering why
> >> it
> >>>>>> is important to include the metrics there? Thread-level metric might
> >> be
> >>>>>> easier to monitor though (ie, less different metric to monitor).
> >>>>>>
> >>>>>> 2) I am a little worried about WARN level logging and that it might
> >> be
> >>>>>> too chatty -- as you pointed out, it's about debugging, thus DEBUG
> >>> level
> >>>>>> might be better. Not 100% sure about this to be honest. What is the
> >>>>>> general assumption about the frequency for skipped records? I could
> >>>>>> imagine cases for which skipped records are quite frequent and thus,
> >>>>>> WARN level logs might "flood" the logs
> >>>>>>
> >>>>>> One final remark:
> >>>>>>
> >>>>>>> More
> >>>>>>> generally, I would like to establish a pattern in which we could
> add
> >>>>> new
> >>>>>>> values for the "reason" tags without needing a KIP to do so.
> >>>>>>
> >>>>>> From my understanding, this is not feasible. Changing metrics is
> >> always
> >>>>>> considered a public API change, and we need a KIP for any change. As
> >> we
> >>>>>> moved away from tagging, it doesn't matter for the KIP anymore --
> >> just
> >>>>>> wanted to point it out.
> >>>>>>
> >>>>>>
> >>>>>> -Matthias
> >>>>>>
> >>>>>>
> >>>>>> On 3/30/18 2:47 PM, John Roesler wrote:
> >>>>>>> Allrighty! The KIP is updated.
> >>>>>>>
> >>>>>>> Thanks again, all, for the feedback.
> >>>>>>> -John
> >>>>>>>
> >>>>>>> On Fri, Mar 30, 2018 at 3:35 PM, John Roesler <jo...@confluent.io>
> >>>>> wrote:
> >>>>>>>
> >>>>>>>> Hey Guozhang and Bill,
> >>>>>>>>
> >>>>>>>> Ok, I'll update the KIP. At the risk of disturbing consensus, I'd
> >>> like
> >>>>>> to
> >>>>>>>> put it in the task instead of the thread so that it'll show up in
> >> the
> >>>>>>>> TopologyTestDriver metrics as well.
> >>>>>>>>
> >>>>>>>> I'm leaning toward keeping the scope where it is right now, but if
> >>>>>> others
> >>>>>>>> want to advocate for tossing in some more metrics, we can go that
> >>>>> route.
> >>>>>>>>
> >>>>>>>> Thanks all,
> >>>>>>>> -John
> >>>>>>>>
> >>>>>>>> On Fri, Mar 30, 2018 at 2:37 PM, Bill Bejeck <bb...@gmail.com>
> >>>>> wrote:
> >>>>>>>>
> >>>>>>>>> Thanks for the KIP John, and sorry for the late comments.
> >>>>>>>>>
> >>>>>>>>> I'm on the fence with providing a single level metrics, but I
> >> think
> >>>>>> we'll
> >>>>>>>>> have that discussion outside of this KIP.
> >>>>>>>>>
> >>>>>>>>>> * maintain one skipped-record metric (could be per-thread,
> >>> per-task,
> >>>>>> or
> >>>>>>>>>> per-processor-node) with no "reason"
> >>>>>>>>>> * introduce a warn-level log detailing the
> topic/partition/offset
> >>>>> and
> >>>>>>>>>> reason of the skipped record
> >>>>>>>>>
> >>>>>>>>> I'm +1 on both of these suggestions.
> >>>>>>>>>
> >>>>>>>>> Finally, we have had requests in the past for some metrics around
> >>>>> when
> >>>>>>>>> persistent store removes an expired window.  Would adding that to
> >>> our
> >>>>>>>>> metrics stretch the scope of this KIP too much?
> >>>>>>>>>
> >>>>>>>>> Thanks again and overall I'm +1 on this KIP
> >>>>>>>>>
> >>>>>>>>> Bill
> >>>>>>>>>
> >>>>>>>>> On Fri, Mar 30, 2018 at 2:00 PM, Guozhang Wang <
> >> wangguoz@gmail.com>
> >>>>>>>>> wrote:
> >>>>>>>>>
> >>>>>>>>>> The proposal sounds good to me. About "maintain only one level
> of
> >>>>>>>>> metrics"
> >>>>>>>>>> maybe we can discuss about that separately from this KIP since
> >> that
> >>>>>>>>> would
> >>>>>>>>>> be a larger scope of discussion. I agree that if we are going to
> >>>>>>>>> maintain
> >>>>>>>>>> only one-level metrics it should be lowest level and we would
> let
> >>>>>> users
> >>>>>>>>> to
> >>>>>>>>>> do the roll-ups themselves, but I'm still not fully convinced
> >> that
> >>>>> we
> >>>>>>>>>> should just provide single-level metrics, because 1) I think for
> >>>>>>>>> different
> >>>>>>>>>> metrics people may be interested to investigate into different
> >>>>>>>>>> granularities, e.g. for poll / commit rate these are at the
> >> lowest
> >>>>>>>>>> task-level metrics, while for process-rate / skip-rate they can
> >> be
> >>>>> as
> >>>>>>>>> low
> >>>>>>>>>> as processor-node metrics, and 2) user-side rolling ups may not
> >> be
> >>>>>> very
> >>>>>>>>>> straight-forward. But for 2) if someone can provide an efficient
> >>> and
> >>>>>>>>> easy
> >>>>>>>>>> implementation of that I can be persuaded :)
> >>>>>>>>>>
> >>>>>>>>>> For now I'm thinking we can add the metric on thread-level,
> >> either
> >>>>>> with
> >>>>>>>>>> finer grained ones with "reason" tag plus an aggregated one
> >> without
> >>>>>> the
> >>>>>>>>>> tag, or just having a single aggregated metric without the tag
> >>> looks
> >>>>>>>>> good
> >>>>>>>>>> to me.
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> Guozhang
> >>>>>>>>>>
> >>>>>>>>>> On Fri, Mar 30, 2018 at 8:05 AM, John Roesler <
> john@confluent.io
> >>>
> >>>>>>>>> wrote:
> >>>>>>>>>>
> >>>>>>>>>>> Hey Guozhang,
> >>>>>>>>>>>
> >>>>>>>>>>> Thanks for the reply. Regarding JMX, I can dig it. I'll provide
> >> a
> >>>>>>>>> list in
> >>>>>>>>>>> the KIP. I was also thinking we'd better start a documentation
> >>> page
> >>>>>>>>> with
> >>>>>>>>>>> the metrics listed.
> >>>>>>>>>>>
> >>>>>>>>>>> I'd have no problem logging a warning when we skip records. On
> >> the
> >>>>>>>>> metric
> >>>>>>>>>>> front, really I'm just pushing for us to maintain only one
> level
> >>> of
> >>>>>>>>>>> metrics. If that's more or less granular (i.e., maybe we don't
> >>>>> have a
> >>>>>>>>>>> metric per reason and log the reason instead), that's fine by
> >> me.
> >>> I
> >>>>>>>>> just
> >>>>>>>>>>> don't think it provides a lot of extra value per complexity
> >>>>>> (interface
> >>>>>>>>>> and
> >>>>>>>>>>> implementation) to maintain roll-ups at the thread level in
> >>>>> addition
> >>>>>>>>> to
> >>>>>>>>>>> lower-level metrics.
> >>>>>>>>>>>
> >>>>>>>>>>> How about this instead:
> >>>>>>>>>>> * maintain one skipped-record metric (could be per-thread,
> >>>>> per-task,
> >>>>>>>>> or
> >>>>>>>>>>> per-processor-node) with no "reason"
> >>>>>>>>>>> * introduce a warn-level log detailing the
> >> topic/partition/offset
> >>>>> and
> >>>>>>>>>>> reason of the skipped record
> >>>>>>>>>>>
> >>>>>>>>>>> If you like that, I can update the KIP.
> >>>>>>>>>>>
> >>>>>>>>>>> Thanks,
> >>>>>>>>>>> -John
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> On Thu, Mar 29, 2018 at 6:22 PM, Guozhang Wang <
> >>> wangguoz@gmail.com
> >>>>>>
> >>>>>>>>>> wrote:
> >>>>>>>>>>>
> >>>>>>>>>>>>> One thing you mention is the notion of setting alerts on
> >> coarser
> >>>>>>>>>>> metrics
> >>>>>>>>>>>> being easier than finer ones. All the metric alerting systems
> I
> >>>>> have
> >>>>>>>>>> used
> >>>>>>>>>>>> make it equally easy to alert on metrics by-tag or over tags.
> >> So
> >>>>> my
> >>>>>>>>>>>> experience doesn't say that this is a use case. Were you
> >> thinking
> >>>>>>>>> of an
> >>>>>>>>>>>> alerting system that makes such a pre-aggregation valuable?
> >>>>>>>>>>>>
> >>>>>>>>>>>> For the commonly used JMX reporter tags will be encoded
> >> directly
> >>>>> as
> >>>>>>>>>> part
> >>>>>>>>>>> of
> >>>>>>>>>>>> the object name, and if users wants to monitor them they need
> >> to
> >>>>>>>>> know
> >>>>>>>>>>> these
> >>>>>>>>>>>> values before hand. That is also why I think we do want to
> list
> >>>>> all
> >>>>>>>>> the
> >>>>>>>>>>>> possible values of the reason tags in the KIP, since
> >>>>>>>>>>>>
> >>>>>>>>>>>>> In my email in response to Matthias, I gave an example of the
> >>>>>>>>> kind of
> >>>>>>>>>>>> scenario that would lead me as an operator to run with DEBUG
> on
> >>>>> all
> >>>>>>>>> the
> >>>>>>>>>>>> time, since I wouldn't be sure, having seen a skipped record
> >>> once,
> >>>>>>>>> that
> >>>>>>>>>>> it
> >>>>>>>>>>>> would ever happen again. The solution is to capture all the
> >>>>>>>>> available
> >>>>>>>>>>>> information about the reason and location of skips all the
> >> time.
> >>>>>>>>>>>>
> >>>>>>>>>>>> That is a good point. I think we can either expose all levels
> >>>>>>>>> metrics
> >>>>>>>>>> as
> >>>>>>>>>>> by
> >>>>>>>>>>>> default, or only expose the most lower-level metrics and get
> >> rid
> >>>>> of
> >>>>>>>>>> other
> >>>>>>>>>>>> levels to let users do roll-ups themselves (which will be a
> >> much
> >>>>>>>>> larger
> >>>>>>>>>>>> scope for discussion), or we can encourage users to not purely
> >>>>>>>>> depend
> >>>>>>>>>> on
> >>>>>>>>>>>> metrics for such trouble shooting: that is to say, users only
> >> be
> >>>>>>>>>> alerted
> >>>>>>>>>>>> based on metrics, and we can log a info / warn log4j entry
> each
> >>>>>>>>> time we
> >>>>>>>>>>> are
> >>>>>>>>>>>> about to skip a record all over the places, so that upon being
> >>>>>>>>> notified
> >>>>>>>>>>>> users can look into the logs to find the details on where /
> >> when
> >>>>> it
> >>>>>>>>>>>> happens. WDYT?
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>> Guozhang
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>> On Thu, Mar 29, 2018 at 3:57 PM, John Roesler <
> >> john@confluent.io
> >>>>
> >>>>>>>>>> wrote:
> >>>>>>>>>>>>
> >>>>>>>>>>>>> Hey Guozhang,
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Thanks for the review.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> 1.
> >>>>>>>>>>>>> Matthias raised the same question about the "reason" tag
> >> values.
> >>>>> I
> >>>>>>>>>> can
> >>>>>>>>>>>> list
> >>>>>>>>>>>>> all possible values of the "reason" tag, but I'm thinking
> this
> >>>>>>>>> level
> >>>>>>>>>> of
> >>>>>>>>>>>>> detail may not be KIP-worthy, maybe the code and
> documentation
> >>>>>>>>> review
> >>>>>>>>>>>> would
> >>>>>>>>>>>>> be sufficient. If you all disagree and would like it included
> >> in
> >>>>>>>>> the
> >>>>>>>>>>>> KIP, I
> >>>>>>>>>>>>> can certainly do that.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> If we do provide roll-up metrics, I agree with the pattern of
> >>>>>>>>> keeping
> >>>>>>>>>>> the
> >>>>>>>>>>>>> same name but eliminating the tags for the dimensions that
> >> were
> >>>>>>>>>>>> rolled-up.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> 2.
> >>>>>>>>>>>>> I'm not too sure that implementation efficiency really
> >> becomes a
> >>>>>>>>>> factor
> >>>>>>>>>>>> in
> >>>>>>>>>>>>> choosing whether to (by default) update one coarse metric at
> >> the
> >>>>>>>>>> thread
> >>>>>>>>>>>>> level or one granular metric at the processor-node level,
> >> since
> >>>>>>>>> it's
> >>>>>>>>>>> just
> >>>>>>>>>>>>> one metric being updated either way. I do agree that if we
> >> were
> >>>>> to
> >>>>>>>>>>> update
> >>>>>>>>>>>>> the granular metrics and multiple roll-ups, then we should
> >>>>>>>>> consider
> >>>>>>>>>> the
> >>>>>>>>>>>>> efficiency.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> I agree it's probably not necessary to surface the metrics
> for
> >>>>> all
> >>>>>>>>>>> nodes
> >>>>>>>>>>>>> regardless of whether they can or do skip records. Perhaps we
> >>> can
> >>>>>>>>>>> lazily
> >>>>>>>>>>>>> register the metrics.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> In my email in response to Matthias, I gave an example of the
> >>>>>>>>> kind of
> >>>>>>>>>>>>> scenario that would lead me as an operator to run with DEBUG
> >> on
> >>>>>>>>> all
> >>>>>>>>>> the
> >>>>>>>>>>>>> time, since I wouldn't be sure, having seen a skipped record
> >>>>> once,
> >>>>>>>>>> that
> >>>>>>>>>>>> it
> >>>>>>>>>>>>> would ever happen again. The solution is to capture all the
> >>>>>>>>> available
> >>>>>>>>>>>>> information about the reason and location of skips all the
> >> time.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> One thing you mention is the notion of setting alerts on
> >> coarser
> >>>>>>>>>>> metrics
> >>>>>>>>>>>>> being easier than finer ones. All the metric alerting systems
> >> I
> >>>>>>>>> have
> >>>>>>>>>>> used
> >>>>>>>>>>>>> make it equally easy to alert on metrics by-tag or over tags.
> >> So
> >>>>>>>>> my
> >>>>>>>>>>>>> experience doesn't say that this is a use case. Were you
> >>> thinking
> >>>>>>>>> of
> >>>>>>>>>> an
> >>>>>>>>>>>>> alerting system that makes such a pre-aggregation valuable?
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Thanks again,
> >>>>>>>>>>>>> -John
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> On Thu, Mar 29, 2018 at 5:24 PM, Guozhang Wang <
> >>>>>>>>> wangguoz@gmail.com>
> >>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>> Hello John,
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Thanks for the KIP. Some comments:
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> 1. Could you list all the possible values of the "reason"
> >> tag?
> >>>>>>>>> In
> >>>>>>>>>> the
> >>>>>>>>>>>>> JIRA
> >>>>>>>>>>>>>> ticket I left some potential reasons but I'm not clear if
> >>> you're
> >>>>>>>>>>> going
> >>>>>>>>>>>> to
> >>>>>>>>>>>>>> categorize each of them as a separate reason, or is there
> any
> >>>>>>>>>>>> additional
> >>>>>>>>>>>>>> ones you have in mind.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Also I'm wondering if we should add another metric that do
> >> not
> >>>>>>>>> have
> >>>>>>>>>>> the
> >>>>>>>>>>>>>> reason tag but aggregates among all possible reasons? This
> is
> >>>>>>>>> for
> >>>>>>>>>>> users
> >>>>>>>>>>>>> to
> >>>>>>>>>>>>>> easily set their alerting notifications (otherwise they have
> >> to
> >>>>>>>>>> write
> >>>>>>>>>>>> on
> >>>>>>>>>>>>>> notification rule per reason) in their monitoring systems.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> 2. Note that the processor-node metrics is actually
> >>> "per-thread,
> >>>>>>>>>>>>> per-task,
> >>>>>>>>>>>>>> per-processor-node", and today we only set the per-thread
> >>>>>>>>> metrics
> >>>>>>>>>> as
> >>>>>>>>>>>> INFO
> >>>>>>>>>>>>>> while leaving the lower two layers as DEBUG. I agree with
> >> your
> >>>>>>>>>>> argument
> >>>>>>>>>>>>>> that we are missing the per-client roll-up metrics today,
> but
> >>>>>>>>> I'm
> >>>>>>>>>>>>> convinced
> >>>>>>>>>>>>>> that the right way to approach it would be
> >>>>>>>>>>> "just-providing-the-lowest-
> >>>>>>>>>>>>>> level
> >>>>>>>>>>>>>> metrics only".
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Note the recoding implementation of these three levels are
> >>>>>>>>>> different
> >>>>>>>>>>>>>> internally today: we did not just do the rolling up to
> >> generate
> >>>>>>>>> the
> >>>>>>>>>>>>>> higher-level metrics from the lower level ones, but we just
> >>>>>>>>> record
> >>>>>>>>>>> them
> >>>>>>>>>>>>>> separately, which means that, if we turn on multiple levels
> >> of
> >>>>>>>>>>> metrics,
> >>>>>>>>>>>>> we
> >>>>>>>>>>>>>> maybe duplicate collecting some metrics. One can argue that
> >> is
> >>>>>>>>> not
> >>>>>>>>>>> the
> >>>>>>>>>>>>> best
> >>>>>>>>>>>>>> way to represent multi-level metrics collecting and
> >> reporting,
> >>>>>>>>> but
> >>>>>>>>>> by
> >>>>>>>>>>>>> only
> >>>>>>>>>>>>>> enabling thread-level metrics as INFO today, that
> >>> implementation
> >>>>>>>>>>> could
> >>>>>>>>>>>> be
> >>>>>>>>>>>>>> more efficient than only collecting the metrics at the
> lowest
> >>>>>>>>>> level,
> >>>>>>>>>>>> and
> >>>>>>>>>>>>>> then do the roll-up calculations outside of the metrics
> >>> classes.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Plus, today not all processor-nodes may possibly skip
> >> records,
> >>>>>>>>>> AFAIK
> >>>>>>>>>>> we
> >>>>>>>>>>>>>> will only skip records at the source, sink, window and
> >>>>>>>>> aggregation
> >>>>>>>>>>>>>> processor nodes, so adding a metric per processor looks like
> >> an
> >>>>>>>>>>>> overkill
> >>>>>>>>>>>>> to
> >>>>>>>>>>>>>> me as well. On the other hand, from user's perspective the
> >>>>>>>>> "reason"
> >>>>>>>>>>> tag
> >>>>>>>>>>>>> may
> >>>>>>>>>>>>>> be sufficient for them to narrow down where inside the
> >> topology
> >>>>>>>>> is
> >>>>>>>>>>>>> causing
> >>>>>>>>>>>>>> records to be dropped on the floor. So I think the
> >> "per-thread,
> >>>>>>>>>>>> per-task"
> >>>>>>>>>>>>>> level metrics should be sufficient for them in trouble shoot
> >> in
> >>>>>>>>>> DEBUG
> >>>>>>>>>>>>> mode,
> >>>>>>>>>>>>>> and we can add another "per-thread" level metrics as INFO
> >> which
> >>>>>>>>> is
> >>>>>>>>>>>> turned
> >>>>>>>>>>>>>> on by default. So under normal execution users still only
> >> need
> >>>>>>>>> INFO
> >>>>>>>>>>>> level
> >>>>>>>>>>>>>> metrics for alerting (e.g. set alerts on all skipped-records
> >>>>>>>>>> metrics
> >>>>>>>>>>> as
> >>>>>>>>>>>>>> non-zero), and then upon trouble shooting they can turn on
> >>> DEBUG
> >>>>>>>>>>>> metrics
> >>>>>>>>>>>>> to
> >>>>>>>>>>>>>> look into which task is actually causing the skipped
> records.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Guozhang
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> On Thu, Mar 29, 2018 at 2:03 PM, Matthias J. Sax <
> >>>>>>>>>>>> matthias@confluent.io>
> >>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Thanks for the KIP John.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Reading the material on the related Jiras, I am wondering
> >> what
> >>>>>>>>>>>> `reason`
> >>>>>>>>>>>>>>> tags you want to introduce? Can you elaborate? The KIP
> >> should
> >>>>>>>>>> list
> >>>>>>>>>>>>> those
> >>>>>>>>>>>>>>> IMHO.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> About the fine grained metrics vs the roll-up: you say that
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> the coarse metric aggregates across two dimensions
> >>>>>>>>>> simultaneously
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Can you elaborate why this is an issue? I am not convinced
> >> atm
> >>>>>>>>>> that
> >>>>>>>>>>>> we
> >>>>>>>>>>>>>>> should put the fine grained metrics into INFO level and
> >> remove
> >>>>>>>>>> the
> >>>>>>>>>>>>>>> roll-up at thread level.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> Given that they have to do this sum to get a usable
> >>>>>>>>> top-level
> >>>>>>>>>>> view
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> This is a fair concern, but I don't share the conclusion.
> >>>>>>>>>> Offering
> >>>>>>>>>>> a
> >>>>>>>>>>>>>>> built-in `KafkaStreams` "client" roll-up out of the box
> >> might
> >>>>>>>>> be
> >>>>>>>>>> a
> >>>>>>>>>>>>>>> better solution. In the past we did not offer this due to
> >>>>>>>>>>> performance
> >>>>>>>>>>>>>>> concerns, but we could allow an "opt-in" mechanism. If you
> >>>>>>>>>>> disagree,
> >>>>>>>>>>>>> can
> >>>>>>>>>>>>>>> you provide some reasoning and add them to the "Rejected
> >>>>>>>>>>>> alternatives"
> >>>>>>>>>>>>>>> section.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> To rephrase: I understand the issue about missing top-level
> >>>>>>>>> view,
> >>>>>>>>>>> but
> >>>>>>>>>>>>>>> instead of going more fine grained, we should consider to
> >> add
> >>>>>>>>>> this
> >>>>>>>>>>>>>>> top-level view and add/keep the fine grained metrics at
> >> DEBUG
> >>>>>>>>>> level
> >>>>>>>>>>>>> only
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> I am +1 to add TopologyTestDriver#metrics() and to remove
> >> old
> >>>>>>>>>>> metrics
> >>>>>>>>>>>>>>> directly as you suggested.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> -Matthias
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> On 3/28/18 6:42 PM, Ted Yu wrote:
> >>>>>>>>>>>>>>>> Looks good to me.
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> On Wed, Mar 28, 2018 at 3:11 PM, John Roesler <
> >>>>>>>>>> john@confluent.io
> >>>>>>>>>>>>
> >>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> Hello all,
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> I am proposing KIP-274 to improve the metrics around
> >>>>>>>>> skipped
> >>>>>>>>>>>> records
> >>>>>>>>>>>>>> in
> >>>>>>>>>>>>>>>>> Streams.
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> Please find the details here:
> >>>>>>>>>>>>>>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-
> >>>>>>>>>>>>>>>>> 274%3A+Kafka+Streams+Skipped+Records+Metrics
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> Please let me know what you think!
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> Thanks,
> >>>>>>>>>>>>>>>>> -John
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> --
> >>>>>>>>>>>>>> -- Guozhang
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>> --
> >>>>>>>>>>>> -- Guozhang
> >>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> --
> >>>>>>>>>> -- Guozhang
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>>>
> >>>>
> >>>
> >>>
> >>
> >
> >
> >
>
>


-- 
-- Guozhang

Re: [DISCUSS] KIP-274: Kafka Streams Skipped Records Metrics

Posted by "Matthias J. Sax" <ma...@confluent.io>.

Thanks Guozhang, that was my intent.

@John: yes, we should not nail down the exact log message. It's just to
point out the trade-off. If we can get the required information in the
logs, we might not need task level metrics.


-Matthias

On 4/3/18 11:26 AM, Guozhang Wang wrote:
> I think Matthias' comment is that, we can still record the metrics on the
> thread-level, while having the WARN log entry to include sufficient context
> information so that users can still easily narrow down the investigation
> scope.
> 
> 
> Guozhang
> 
> On Tue, Apr 3, 2018 at 11:22 AM, John Roesler <jo...@confluent.io> wrote:
> 
>> I agree we should add as much information as is reasonable to the log. For
>> example, see this WIP PR I started for this KIP:
>>
>> https://github.com/apache/kafka/pull/4812/files#diff-
>> 88d129f048bc842c7db5b2566a45fce8R80
>>
>> and
>>
>> https://github.com/apache/kafka/pull/4812/files#diff-
>> 69e6789eb675ec978a1abd24fed96eb1R111
>>
>> I'm not sure if we should nail down the log messages in the KIP or in the
>> PR discussion. What say you?
>>
>> Thanks,
>> -John
>>
>> On Tue, Apr 3, 2018 at 12:20 AM, Matthias J. Sax <ma...@confluent.io>
>> wrote:
>>
>>> Thanks for sharing your thoughts. As I mentioned originally, I am not
>>> sure about the right log level either. Your arguments are convincing --
>>> thus, I am fine with keeping WARN level.
>>>
>>> The task vs thread level argument is an interesting one. However, I am
>>> wondering if we should add this information into the corresponding WARN
>>> logs that we write anyway? For this case, we can also log the
>>> corresponding operator (and other information like topic name etc if
>>> needed). WDYT about this?
>>>
>>>
>>> -Matthias
>>>
>>> On 4/2/18 8:31 PM, Guozhang Wang wrote:
>>>> Regarding logging: I'm inclined to keep logging at WARN level since
>>> skipped
>>>> records are not expected in normal execution (for all reasons that we
>> are
>>>> aware of), and hence when error happens users should be alerted from
>>>> metrics and looked into the log files, so to me if it is really
>> spamming
>>>> the log files it is also a good alert for users. Besides for
>> deserialize
>>>> errors we already log at WARN level for this reason.
>>>>
>>>> Regarding the metrics-levels: I was pondering on that as well. What
>> made
>>> me
>>>> to think and agree on task-level than thread-level is that for some
>>> reasons
>>>> like window retention, they may possibly be happening on a subset of
>>> input
>>>> partitions, and tasks are correlated with partitions the task-level
>>> metrics
>>>> can help users to narrow down on the specific input data partitions.
>>>>
>>>>
>>>> Guozhang
>>>>
>>>>
>>>> On Mon, Apr 2, 2018 at 6:43 PM, John Roesler <jo...@confluent.io>
>> wrote:
>>>>
>>>>> Hi Matthias,
>>>>>
>>>>> No worries! Thanks for the reply.
>>>>>
>>>>> 1) There isn't a connection. I tried using the TopologyTestDriver to
>>> write
>>>>> a quick test exercising the current behavior and discovered that the
>>>>> metrics weren't available. It seemed like they should be, so I tacked
>>> it on
>>>>> to this KIP. If you feel it's inappropriate, I can pull it back out.
>>>>>
>>>>> 2) I was also concerned about that, but I figured it would come up in
>>>>> discussion if I just went ahead and proposed it. And here we are!
>>>>>
>>>>> Here's my thought: maybe there are two classes of skips: "controlled"
>>> and
>>>>> "uncontrolled", where "controlled" means, as an app author, I
>>> deliberately
>>>>> filter out some events, and "uncontrolled" means that I simply don't
>>>>> account for some feature of the data, and the framework skips them (as
>>>>> opposed to crashing).
>>>>>
>>>>> In this breakdowns, the skips I'm adding metrics for are all
>>> uncontrolled
>>>>> skips (and we hope to measure all the uncontrolled skips). Our skips
>> are
>>>>> well documented, so it wouldn't be terrible to have an application in
>>> which
>>>>> you know you expect to have tons of uncontrolled skips, but it's not
>>> great
>>>>> either, since you may also have some *unexpected* uncontrolled skips.
>>> It'll
>>>>> be difficult to notice, since you're probably not alerting on the
>> metric
>>>>> and filtering out the logs (whatever their level).
>>>>>
>>>>> I'd recommend any app author, as an alternative, to convert all
>> expected
>>>>> skips to controlled ones, by updating the topology to filter those
>>> records
>>>>> out.
>>>>>
>>>>> Following from my recommendation, as a library author, I'm inclined to
>>> mark
>>>>> those logs WARN, since in my opinion, they should be concerning to the
>>> app
>>>>> authors. I'd definitely want to show, rather than hide, them by
>>> default, so
>>>>> I would pick INFO at least.
>>>>>
>>>>> That said, logging is always a tricky issue for lower-level libraries
>>> that
>>>>> run inside user code, since we don't have all the information we need
>> to
>>>>> make the right call.
>>>>>
>>>>>
>>>>>
>>>>> On your last note, yeah, I got that impression from Guozhang as well.
>>>>> Thanks for the clarification.
>>>>>
>>>>> -John
>>>>>
>>>>>
>>>>>
>>>>> On Mon, Apr 2, 2018 at 4:03 PM, Matthias J. Sax <
>> matthias@confluent.io>
>>>>> wrote:
>>>>>
>>>>>> John,
>>>>>>
>>>>>> sorry for my late reply and thanks for updating the KIP.
>>>>>>
>>>>>> I like your approach about "metrics are for monitoring, logs are for
>>>>>> debugging" -- however:
>>>>>>
>>>>>> 1) I don't see a connection between this and the task-level metrics
>>> that
>>>>>> you propose to get the metrics in `TopologyTestDriver`. I don't think
>>>>>> people would monitor the `TopologyTestDriver` an thus wondering why
>> it
>>>>>> is important to include the metrics there? Thread-level metric might
>> be
>>>>>> easier to monitor though (ie, less different metric to monitor).
>>>>>>
>>>>>> 2) I am a little worried about WARN level logging and that it might
>> be
>>>>>> too chatty -- as you pointed out, it's about debugging, thus DEBUG
>>> level
>>>>>> might be better. Not 100% sure about this to be honest. What is the
>>>>>> general assumption about the frequency for skipped records? I could
>>>>>> imagine cases for which skipped records are quite frequent and thus,
>>>>>> WARN level logs might "flood" the logs
>>>>>>
>>>>>> One final remark:
>>>>>>
>>>>>>> More
>>>>>>> generally, I would like to establish a pattern in which we could add
>>>>> new
>>>>>>> values for the "reason" tags without needing a KIP to do so.
>>>>>>
>>>>>> From my understanding, this is not feasible. Changing metrics is
>> always
>>>>>> considered a public API change, and we need a KIP for any change. As
>> we
>>>>>> moved away from tagging, it doesn't matter for the KIP anymore --
>> just
>>>>>> wanted to point it out.
>>>>>>
>>>>>>
>>>>>> -Matthias
>>>>>>
>>>>>>
>>>>>> On 3/30/18 2:47 PM, John Roesler wrote:
>>>>>>> Allrighty! The KIP is updated.
>>>>>>>
>>>>>>> Thanks again, all, for the feedback.
>>>>>>> -John
>>>>>>>
>>>>>>> On Fri, Mar 30, 2018 at 3:35 PM, John Roesler <jo...@confluent.io>
>>>>> wrote:
>>>>>>>
>>>>>>>> Hey Guozhang and Bill,
>>>>>>>>
>>>>>>>> Ok, I'll update the KIP. At the risk of disturbing consensus, I'd
>>> like
>>>>>> to
>>>>>>>> put it in the task instead of the thread so that it'll show up in
>> the
>>>>>>>> TopologyTestDriver metrics as well.
>>>>>>>>
>>>>>>>> I'm leaning toward keeping the scope where it is right now, but if
>>>>>> others
>>>>>>>> want to advocate for tossing in some more metrics, we can go that
>>>>> route.
>>>>>>>>
>>>>>>>> Thanks all,
>>>>>>>> -John
>>>>>>>>
>>>>>>>> On Fri, Mar 30, 2018 at 2:37 PM, Bill Bejeck <bb...@gmail.com>
>>>>> wrote:
>>>>>>>>
>>>>>>>>> Thanks for the KIP John, and sorry for the late comments.
>>>>>>>>>
>>>>>>>>> I'm on the fence with providing a single level metrics, but I
>> think
>>>>>> we'll
>>>>>>>>> have that discussion outside of this KIP.
>>>>>>>>>
>>>>>>>>>> * maintain one skipped-record metric (could be per-thread,
>>> per-task,
>>>>>> or
>>>>>>>>>> per-processor-node) with no "reason"
>>>>>>>>>> * introduce a warn-level log detailing the topic/partition/offset
>>>>> and
>>>>>>>>>> reason of the skipped record
>>>>>>>>>
>>>>>>>>> I'm +1 on both of these suggestions.
>>>>>>>>>
>>>>>>>>> Finally, we have had requests in the past for some metrics around
>>>>> when
>>>>>>>>> persistent store removes an expired window.  Would adding that to
>>> our
>>>>>>>>> metrics stretch the scope of this KIP too much?
>>>>>>>>>
>>>>>>>>> Thanks again and overall I'm +1 on this KIP
>>>>>>>>>
>>>>>>>>> Bill
>>>>>>>>>
>>>>>>>>> On Fri, Mar 30, 2018 at 2:00 PM, Guozhang Wang <
>> wangguoz@gmail.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> The proposal sounds good to me. About "maintain only one level of
>>>>>>>>> metrics"
>>>>>>>>>> maybe we can discuss about that separately from this KIP since
>> that
>>>>>>>>> would
>>>>>>>>>> be a larger scope of discussion. I agree that if we are going to
>>>>>>>>> maintain
>>>>>>>>>> only one-level metrics it should be lowest level and we would let
>>>>>> users
>>>>>>>>> to
>>>>>>>>>> do the roll-ups themselves, but I'm still not fully convinced
>> that
>>>>> we
>>>>>>>>>> should just provide single-level metrics, because 1) I think for
>>>>>>>>> different
>>>>>>>>>> metrics people may be interested to investigate into different
>>>>>>>>>> granularities, e.g. for poll / commit rate these are at the
>> lowest
>>>>>>>>>> task-level metrics, while for process-rate / skip-rate they can
>> be
>>>>> as
>>>>>>>>> low
>>>>>>>>>> as processor-node metrics, and 2) user-side rolling ups may not
>> be
>>>>>> very
>>>>>>>>>> straight-forward. But for 2) if someone can provide an efficient
>>> and
>>>>>>>>> easy
>>>>>>>>>> implementation of that I can be persuaded :)
>>>>>>>>>>
>>>>>>>>>> For now I'm thinking we can add the metric on thread-level,
>> either
>>>>>> with
>>>>>>>>>> finer grained ones with "reason" tag plus an aggregated one
>> without
>>>>>> the
>>>>>>>>>> tag, or just having a single aggregated metric without the tag
>>> looks
>>>>>>>>> good
>>>>>>>>>> to me.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Guozhang
>>>>>>>>>>
>>>>>>>>>> On Fri, Mar 30, 2018 at 8:05 AM, John Roesler <john@confluent.io
>>>
>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> Hey Guozhang,
>>>>>>>>>>>
>>>>>>>>>>> Thanks for the reply. Regarding JMX, I can dig it. I'll provide
>> a
>>>>>>>>> list in
>>>>>>>>>>> the KIP. I was also thinking we'd better start a documentation
>>> page
>>>>>>>>> with
>>>>>>>>>>> the metrics listed.
>>>>>>>>>>>
>>>>>>>>>>> I'd have no problem logging a warning when we skip records. On
>> the
>>>>>>>>> metric
>>>>>>>>>>> front, really I'm just pushing for us to maintain only one level
>>> of
>>>>>>>>>>> metrics. If that's more or less granular (i.e., maybe we don't
>>>>> have a
>>>>>>>>>>> metric per reason and log the reason instead), that's fine by
>> me.
>>> I
>>>>>>>>> just
>>>>>>>>>>> don't think it provides a lot of extra value per complexity
>>>>>> (interface
>>>>>>>>>> and
>>>>>>>>>>> implementation) to maintain roll-ups at the thread level in
>>>>> addition
>>>>>>>>> to
>>>>>>>>>>> lower-level metrics.
>>>>>>>>>>>
>>>>>>>>>>> How about this instead:
>>>>>>>>>>> * maintain one skipped-record metric (could be per-thread,
>>>>> per-task,
>>>>>>>>> or
>>>>>>>>>>> per-processor-node) with no "reason"
>>>>>>>>>>> * introduce a warn-level log detailing the
>> topic/partition/offset
>>>>> and
>>>>>>>>>>> reason of the skipped record
>>>>>>>>>>>
>>>>>>>>>>> If you like that, I can update the KIP.
>>>>>>>>>>>
>>>>>>>>>>> Thanks,
>>>>>>>>>>> -John
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Thu, Mar 29, 2018 at 6:22 PM, Guozhang Wang <
>>> wangguoz@gmail.com
>>>>>>
>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>>>> One thing you mention is the notion of setting alerts on
>> coarser
>>>>>>>>>>> metrics
>>>>>>>>>>>> being easier than finer ones. All the metric alerting systems I
>>>>> have
>>>>>>>>>> used
>>>>>>>>>>>> make it equally easy to alert on metrics by-tag or over tags.
>> So
>>>>> my
>>>>>>>>>>>> experience doesn't say that this is a use case. Were you
>> thinking
>>>>>>>>> of an
>>>>>>>>>>>> alerting system that makes such a pre-aggregation valuable?
>>>>>>>>>>>>
>>>>>>>>>>>> For the commonly used JMX reporter tags will be encoded
>> directly
>>>>> as
>>>>>>>>>> part
>>>>>>>>>>> of
>>>>>>>>>>>> the object name, and if users wants to monitor them they need
>> to
>>>>>>>>> know
>>>>>>>>>>> these
>>>>>>>>>>>> values before hand. That is also why I think we do want to list
>>>>> all
>>>>>>>>> the
>>>>>>>>>>>> possible values of the reason tags in the KIP, since
>>>>>>>>>>>>
>>>>>>>>>>>>> In my email in response to Matthias, I gave an example of the
>>>>>>>>> kind of
>>>>>>>>>>>> scenario that would lead me as an operator to run with DEBUG on
>>>>> all
>>>>>>>>> the
>>>>>>>>>>>> time, since I wouldn't be sure, having seen a skipped record
>>> once,
>>>>>>>>> that
>>>>>>>>>>> it
>>>>>>>>>>>> would ever happen again. The solution is to capture all the
>>>>>>>>> available
>>>>>>>>>>>> information about the reason and location of skips all the
>> time.
>>>>>>>>>>>>
>>>>>>>>>>>> That is a good point. I think we can either expose all levels
>>>>>>>>> metrics
>>>>>>>>>> as
>>>>>>>>>>> by
>>>>>>>>>>>> default, or only expose the most lower-level metrics and get
>> rid
>>>>> of
>>>>>>>>>> other
>>>>>>>>>>>> levels to let users do roll-ups themselves (which will be a
>> much
>>>>>>>>> larger
>>>>>>>>>>>> scope for discussion), or we can encourage users to not purely
>>>>>>>>> depend
>>>>>>>>>> on
>>>>>>>>>>>> metrics for such trouble shooting: that is to say, users only
>> be
>>>>>>>>>> alerted
>>>>>>>>>>>> based on metrics, and we can log a info / warn log4j entry each
>>>>>>>>> time we
>>>>>>>>>>> are
>>>>>>>>>>>> about to skip a record all over the places, so that upon being
>>>>>>>>> notified
>>>>>>>>>>>> users can look into the logs to find the details on where /
>> when
>>>>> it
>>>>>>>>>>>> happens. WDYT?
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Guozhang
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Thu, Mar 29, 2018 at 3:57 PM, John Roesler <
>> john@confluent.io
>>>>
>>>>>>>>>> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Hey Guozhang,
>>>>>>>>>>>>>
>>>>>>>>>>>>> Thanks for the review.
>>>>>>>>>>>>>
>>>>>>>>>>>>> 1.
>>>>>>>>>>>>> Matthias raised the same question about the "reason" tag
>> values.
>>>>> I
>>>>>>>>>> can
>>>>>>>>>>>> list
>>>>>>>>>>>>> all possible values of the "reason" tag, but I'm thinking this
>>>>>>>>> level
>>>>>>>>>> of
>>>>>>>>>>>>> detail may not be KIP-worthy, maybe the code and documentation
>>>>>>>>> review
>>>>>>>>>>>> would
>>>>>>>>>>>>> be sufficient. If you all disagree and would like it included
>> in
>>>>>>>>> the
>>>>>>>>>>>> KIP, I
>>>>>>>>>>>>> can certainly do that.
>>>>>>>>>>>>>
>>>>>>>>>>>>> If we do provide roll-up metrics, I agree with the pattern of
>>>>>>>>> keeping
>>>>>>>>>>> the
>>>>>>>>>>>>> same name but eliminating the tags for the dimensions that
>> were
>>>>>>>>>>>> rolled-up.
>>>>>>>>>>>>>
>>>>>>>>>>>>> 2.
>>>>>>>>>>>>> I'm not too sure that implementation efficiency really
>> becomes a
>>>>>>>>>> factor
>>>>>>>>>>>> in
>>>>>>>>>>>>> choosing whether to (by default) update one coarse metric at
>> the
>>>>>>>>>> thread
>>>>>>>>>>>>> level or one granular metric at the processor-node level,
>> since
>>>>>>>>> it's
>>>>>>>>>>> just
>>>>>>>>>>>>> one metric being updated either way. I do agree that if we
>> were
>>>>> to
>>>>>>>>>>> update
>>>>>>>>>>>>> the granular metrics and multiple roll-ups, then we should
>>>>>>>>> consider
>>>>>>>>>> the
>>>>>>>>>>>>> efficiency.
>>>>>>>>>>>>>
>>>>>>>>>>>>> I agree it's probably not necessary to surface the metrics for
>>>>> all
>>>>>>>>>>> nodes
>>>>>>>>>>>>> regardless of whether they can or do skip records. Perhaps we
>>> can
>>>>>>>>>>> lazily
>>>>>>>>>>>>> register the metrics.
>>>>>>>>>>>>>
>>>>>>>>>>>>> In my email in response to Matthias, I gave an example of the
>>>>>>>>> kind of
>>>>>>>>>>>>> scenario that would lead me as an operator to run with DEBUG
>> on
>>>>>>>>> all
>>>>>>>>>> the
>>>>>>>>>>>>> time, since I wouldn't be sure, having seen a skipped record
>>>>> once,
>>>>>>>>>> that
>>>>>>>>>>>> it
>>>>>>>>>>>>> would ever happen again. The solution is to capture all the
>>>>>>>>> available
>>>>>>>>>>>>> information about the reason and location of skips all the
>> time.
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> One thing you mention is the notion of setting alerts on
>> coarser
>>>>>>>>>>> metrics
>>>>>>>>>>>>> being easier than finer ones. All the metric alerting systems
>> I
>>>>>>>>> have
>>>>>>>>>>> used
>>>>>>>>>>>>> make it equally easy to alert on metrics by-tag or over tags.
>> So
>>>>>>>>> my
>>>>>>>>>>>>> experience doesn't say that this is a use case. Were you
>>> thinking
>>>>>>>>> of
>>>>>>>>>> an
>>>>>>>>>>>>> alerting system that makes such a pre-aggregation valuable?
>>>>>>>>>>>>>
>>>>>>>>>>>>> Thanks again,
>>>>>>>>>>>>> -John
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Thu, Mar 29, 2018 at 5:24 PM, Guozhang Wang <
>>>>>>>>> wangguoz@gmail.com>
>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Hello John,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thanks for the KIP. Some comments:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> 1. Could you list all the possible values of the "reason"
>> tag?
>>>>>>>>> In
>>>>>>>>>> the
>>>>>>>>>>>>> JIRA
>>>>>>>>>>>>>> ticket I left some potential reasons but I'm not clear if
>>> you're
>>>>>>>>>>> going
>>>>>>>>>>>> to
>>>>>>>>>>>>>> categorize each of them as a separate reason, or is there any
>>>>>>>>>>>> additional
>>>>>>>>>>>>>> ones you have in mind.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Also I'm wondering if we should add another metric that do
>> not
>>>>>>>>> have
>>>>>>>>>>> the
>>>>>>>>>>>>>> reason tag but aggregates among all possible reasons? This is
>>>>>>>>> for
>>>>>>>>>>> users
>>>>>>>>>>>>> to
>>>>>>>>>>>>>> easily set their alerting notifications (otherwise they have
>> to
>>>>>>>>>> write
>>>>>>>>>>>> on
>>>>>>>>>>>>>> notification rule per reason) in their monitoring systems.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> 2. Note that the processor-node metrics is actually
>>> "per-thread,
>>>>>>>>>>>>> per-task,
>>>>>>>>>>>>>> per-processor-node", and today we only set the per-thread
>>>>>>>>> metrics
>>>>>>>>>> as
>>>>>>>>>>>> INFO
>>>>>>>>>>>>>> while leaving the lower two layers as DEBUG. I agree with
>> your
>>>>>>>>>>> argument
>>>>>>>>>>>>>> that we are missing the per-client roll-up metrics today, but
>>>>>>>>> I'm
>>>>>>>>>>>>> convinced
>>>>>>>>>>>>>> that the right way to approach it would be
>>>>>>>>>>> "just-providing-the-lowest-
>>>>>>>>>>>>>> level
>>>>>>>>>>>>>> metrics only".
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Note the recoding implementation of these three levels are
>>>>>>>>>> different
>>>>>>>>>>>>>> internally today: we did not just do the rolling up to
>> generate
>>>>>>>>> the
>>>>>>>>>>>>>> higher-level metrics from the lower level ones, but we just
>>>>>>>>> record
>>>>>>>>>>> them
>>>>>>>>>>>>>> separately, which means that, if we turn on multiple levels
>> of
>>>>>>>>>>> metrics,
>>>>>>>>>>>>> we
>>>>>>>>>>>>>> maybe duplicate collecting some metrics. One can argue that
>> is
>>>>>>>>> not
>>>>>>>>>>> the
>>>>>>>>>>>>> best
>>>>>>>>>>>>>> way to represent multi-level metrics collecting and
>> reporting,
>>>>>>>>> but
>>>>>>>>>> by
>>>>>>>>>>>>> only
>>>>>>>>>>>>>> enabling thread-level metrics as INFO today, that
>>> implementation
>>>>>>>>>>> could
>>>>>>>>>>>> be
>>>>>>>>>>>>>> more efficient than only collecting the metrics at the lowest
>>>>>>>>>> level,
>>>>>>>>>>>> and
>>>>>>>>>>>>>> then do the roll-up calculations outside of the metrics
>>> classes.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Plus, today not all processor-nodes may possibly skip
>> records,
>>>>>>>>>> AFAIK
>>>>>>>>>>> we
>>>>>>>>>>>>>> will only skip records at the source, sink, window and
>>>>>>>>> aggregation
>>>>>>>>>>>>>> processor nodes, so adding a metric per processor looks like
>> an
>>>>>>>>>>>> overkill
>>>>>>>>>>>>> to
>>>>>>>>>>>>>> me as well. On the other hand, from user's perspective the
>>>>>>>>> "reason"
>>>>>>>>>>> tag
>>>>>>>>>>>>> may
>>>>>>>>>>>>>> be sufficient for them to narrow down where inside the
>> topology
>>>>>>>>> is
>>>>>>>>>>>>> causing
>>>>>>>>>>>>>> records to be dropped on the floor. So I think the
>> "per-thread,
>>>>>>>>>>>> per-task"
>>>>>>>>>>>>>> level metrics should be sufficient for them in trouble shoot
>> in
>>>>>>>>>> DEBUG
>>>>>>>>>>>>> mode,
>>>>>>>>>>>>>> and we can add another "per-thread" level metrics as INFO
>> which
>>>>>>>>> is
>>>>>>>>>>>> turned
>>>>>>>>>>>>>> on by default. So under normal execution users still only
>> need
>>>>>>>>> INFO
>>>>>>>>>>>> level
>>>>>>>>>>>>>> metrics for alerting (e.g. set alerts on all skipped-records
>>>>>>>>>> metrics
>>>>>>>>>>> as
>>>>>>>>>>>>>> non-zero), and then upon trouble shooting they can turn on
>>> DEBUG
>>>>>>>>>>>> metrics
>>>>>>>>>>>>> to
>>>>>>>>>>>>>> look into which task is actually causing the skipped records.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Guozhang
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Thu, Mar 29, 2018 at 2:03 PM, Matthias J. Sax <
>>>>>>>>>>>> matthias@confluent.io>
>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Thanks for the KIP John.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Reading the material on the related Jiras, I am wondering
>> what
>>>>>>>>>>>> `reason`
>>>>>>>>>>>>>>> tags you want to introduce? Can you elaborate? The KIP
>> should
>>>>>>>>>> list
>>>>>>>>>>>>> those
>>>>>>>>>>>>>>> IMHO.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> About the fine grained metrics vs the roll-up: you say that
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> the coarse metric aggregates across two dimensions
>>>>>>>>>> simultaneously
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Can you elaborate why this is an issue? I am not convinced
>> atm
>>>>>>>>>> that
>>>>>>>>>>>> we
>>>>>>>>>>>>>>> should put the fine grained metrics into INFO level and
>> remove
>>>>>>>>>> the
>>>>>>>>>>>>>>> roll-up at thread level.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Given that they have to do this sum to get a usable
>>>>>>>>> top-level
>>>>>>>>>>> view
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> This is a fair concern, but I don't share the conclusion.
>>>>>>>>>> Offering
>>>>>>>>>>> a
>>>>>>>>>>>>>>> built-in `KafkaStreams` "client" roll-up out of the box
>> might
>>>>>>>>> be
>>>>>>>>>> a
>>>>>>>>>>>>>>> better solution. In the past we did not offer this due to
>>>>>>>>>>> performance
>>>>>>>>>>>>>>> concerns, but we could allow an "opt-in" mechanism. If you
>>>>>>>>>>> disagree,
>>>>>>>>>>>>> can
>>>>>>>>>>>>>>> you provide some reasoning and add them to the "Rejected
>>>>>>>>>>>> alternatives"
>>>>>>>>>>>>>>> section.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> To rephrase: I understand the issue about missing top-level
>>>>>>>>> view,
>>>>>>>>>>> but
>>>>>>>>>>>>>>> instead of going more fine grained, we should consider to
>> add
>>>>>>>>>> this
>>>>>>>>>>>>>>> top-level view and add/keep the fine grained metrics at
>> DEBUG
>>>>>>>>>> level
>>>>>>>>>>>>> only
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I am +1 to add TopologyTestDriver#metrics() and to remove
>> old
>>>>>>>>>>> metrics
>>>>>>>>>>>>>>> directly as you suggested.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> -Matthias
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On 3/28/18 6:42 PM, Ted Yu wrote:
>>>>>>>>>>>>>>>> Looks good to me.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Wed, Mar 28, 2018 at 3:11 PM, John Roesler <
>>>>>>>>>> john@confluent.io
>>>>>>>>>>>>
>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Hello all,
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> I am proposing KIP-274 to improve the metrics around
>>>>>>>>> skipped
>>>>>>>>>>>> records
>>>>>>>>>>>>>> in
>>>>>>>>>>>>>>>>> Streams.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Please find the details here:
>>>>>>>>>>>>>>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-
>>>>>>>>>>>>>>>>> 274%3A+Kafka+Streams+Skipped+Records+Metrics
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Please let me know what you think!
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>> -John
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> --
>>>>>>>>>>>>>> -- Guozhang
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> --
>>>>>>>>>>>> -- Guozhang
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>> -- Guozhang
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>>>
>>>>
>>>
>>>
>>
> 
> 
>

Re: [DISCUSS] KIP-274: Kafka Streams Skipped Records Metrics

Posted by John Roesler <jo...@confluent.io>.

Oh, sorry, I missed the point.

Yeah, we can totally do that. The reason to move it to the task level was
mainly to make it available for the metrics in TopologyTestDriver as well.
But if we decide that's a non-goal, then there's no motivation to change it.

And actually that reminds me that we do have an open question about whether
I should add a metrics getter to the TopologyTestDriver's interface. WDYT?

Thanks,
-John

On Tue, Apr 3, 2018 at 1:26 PM, Guozhang Wang <wa...@gmail.com> wrote:

> I think Matthias' comment is that, we can still record the metrics on the
> thread-level, while having the WARN log entry to include sufficient context
> information so that users can still easily narrow down the investigation
> scope.
>
>
> Guozhang
>
> On Tue, Apr 3, 2018 at 11:22 AM, John Roesler <jo...@confluent.io> wrote:
>
> > I agree we should add as much information as is reasonable to the log.
> For
> > example, see this WIP PR I started for this KIP:
> >
> > https://github.com/apache/kafka/pull/4812/files#diff-
> > 88d129f048bc842c7db5b2566a45fce8R80
> >
> > and
> >
> > https://github.com/apache/kafka/pull/4812/files#diff-
> > 69e6789eb675ec978a1abd24fed96eb1R111
> >
> > I'm not sure if we should nail down the log messages in the KIP or in the
> > PR discussion. What say you?
> >
> > Thanks,
> > -John
> >
> > On Tue, Apr 3, 2018 at 12:20 AM, Matthias J. Sax <ma...@confluent.io>
> > wrote:
> >
> > > Thanks for sharing your thoughts. As I mentioned originally, I am not
> > > sure about the right log level either. Your arguments are convincing --
> > > thus, I am fine with keeping WARN level.
> > >
> > > The task vs thread level argument is an interesting one. However, I am
> > > wondering if we should add this information into the corresponding WARN
> > > logs that we write anyway? For this case, we can also log the
> > > corresponding operator (and other information like topic name etc if
> > > needed). WDYT about this?
> > >
> > >
> > > -Matthias
> > >
> > > On 4/2/18 8:31 PM, Guozhang Wang wrote:
> > > > Regarding logging: I'm inclined to keep logging at WARN level since
> > > skipped
> > > > records are not expected in normal execution (for all reasons that we
> > are
> > > > aware of), and hence when error happens users should be alerted from
> > > > metrics and looked into the log files, so to me if it is really
> > spamming
> > > > the log files it is also a good alert for users. Besides for
> > deserialize
> > > > errors we already log at WARN level for this reason.
> > > >
> > > > Regarding the metrics-levels: I was pondering on that as well. What
> > made
> > > me
> > > > to think and agree on task-level than thread-level is that for some
> > > reasons
> > > > like window retention, they may possibly be happening on a subset of
> > > input
> > > > partitions, and tasks are correlated with partitions the task-level
> > > metrics
> > > > can help users to narrow down on the specific input data partitions.
> > > >
> > > >
> > > > Guozhang
> > > >
> > > >
> > > > On Mon, Apr 2, 2018 at 6:43 PM, John Roesler <jo...@confluent.io>
> > wrote:
> > > >
> > > >> Hi Matthias,
> > > >>
> > > >> No worries! Thanks for the reply.
> > > >>
> > > >> 1) There isn't a connection. I tried using the TopologyTestDriver to
> > > write
> > > >> a quick test exercising the current behavior and discovered that the
> > > >> metrics weren't available. It seemed like they should be, so I
> tacked
> > > it on
> > > >> to this KIP. If you feel it's inappropriate, I can pull it back out.
> > > >>
> > > >> 2) I was also concerned about that, but I figured it would come up
> in
> > > >> discussion if I just went ahead and proposed it. And here we are!
> > > >>
> > > >> Here's my thought: maybe there are two classes of skips:
> "controlled"
> > > and
> > > >> "uncontrolled", where "controlled" means, as an app author, I
> > > deliberately
> > > >> filter out some events, and "uncontrolled" means that I simply don't
> > > >> account for some feature of the data, and the framework skips them
> (as
> > > >> opposed to crashing).
> > > >>
> > > >> In this breakdowns, the skips I'm adding metrics for are all
> > > uncontrolled
> > > >> skips (and we hope to measure all the uncontrolled skips). Our skips
> > are
> > > >> well documented, so it wouldn't be terrible to have an application
> in
> > > which
> > > >> you know you expect to have tons of uncontrolled skips, but it's not
> > > great
> > > >> either, since you may also have some *unexpected* uncontrolled
> skips.
> > > It'll
> > > >> be difficult to notice, since you're probably not alerting on the
> > metric
> > > >> and filtering out the logs (whatever their level).
> > > >>
> > > >> I'd recommend any app author, as an alternative, to convert all
> > expected
> > > >> skips to controlled ones, by updating the topology to filter those
> > > records
> > > >> out.
> > > >>
> > > >> Following from my recommendation, as a library author, I'm inclined
> to
> > > mark
> > > >> those logs WARN, since in my opinion, they should be concerning to
> the
> > > app
> > > >> authors. I'd definitely want to show, rather than hide, them by
> > > default, so
> > > >> I would pick INFO at least.
> > > >>
> > > >> That said, logging is always a tricky issue for lower-level
> libraries
> > > that
> > > >> run inside user code, since we don't have all the information we
> need
> > to
> > > >> make the right call.
> > > >>
> > > >>
> > > >>
> > > >> On your last note, yeah, I got that impression from Guozhang as
> well.
> > > >> Thanks for the clarification.
> > > >>
> > > >> -John
> > > >>
> > > >>
> > > >>
> > > >> On Mon, Apr 2, 2018 at 4:03 PM, Matthias J. Sax <
> > matthias@confluent.io>
> > > >> wrote:
> > > >>
> > > >>> John,
> > > >>>
> > > >>> sorry for my late reply and thanks for updating the KIP.
> > > >>>
> > > >>> I like your approach about "metrics are for monitoring, logs are
> for
> > > >>> debugging" -- however:
> > > >>>
> > > >>> 1) I don't see a connection between this and the task-level metrics
> > > that
> > > >>> you propose to get the metrics in `TopologyTestDriver`. I don't
> think
> > > >>> people would monitor the `TopologyTestDriver` an thus wondering why
> > it
> > > >>> is important to include the metrics there? Thread-level metric
> might
> > be
> > > >>> easier to monitor though (ie, less different metric to monitor).
> > > >>>
> > > >>> 2) I am a little worried about WARN level logging and that it might
> > be
> > > >>> too chatty -- as you pointed out, it's about debugging, thus DEBUG
> > > level
> > > >>> might be better. Not 100% sure about this to be honest. What is the
> > > >>> general assumption about the frequency for skipped records? I could
> > > >>> imagine cases for which skipped records are quite frequent and
> thus,
> > > >>> WARN level logs might "flood" the logs
> > > >>>
> > > >>> One final remark:
> > > >>>
> > > >>>> More
> > > >>>> generally, I would like to establish a pattern in which we could
> add
> > > >> new
> > > >>>> values for the "reason" tags without needing a KIP to do so.
> > > >>>
> > > >>> From my understanding, this is not feasible. Changing metrics is
> > always
> > > >>> considered a public API change, and we need a KIP for any change.
> As
> > we
> > > >>> moved away from tagging, it doesn't matter for the KIP anymore --
> > just
> > > >>> wanted to point it out.
> > > >>>
> > > >>>
> > > >>> -Matthias
> > > >>>
> > > >>>
> > > >>> On 3/30/18 2:47 PM, John Roesler wrote:
> > > >>>> Allrighty! The KIP is updated.
> > > >>>>
> > > >>>> Thanks again, all, for the feedback.
> > > >>>> -John
> > > >>>>
> > > >>>> On Fri, Mar 30, 2018 at 3:35 PM, John Roesler <jo...@confluent.io>
> > > >> wrote:
> > > >>>>
> > > >>>>> Hey Guozhang and Bill,
> > > >>>>>
> > > >>>>> Ok, I'll update the KIP. At the risk of disturbing consensus, I'd
> > > like
> > > >>> to
> > > >>>>> put it in the task instead of the thread so that it'll show up in
> > the
> > > >>>>> TopologyTestDriver metrics as well.
> > > >>>>>
> > > >>>>> I'm leaning toward keeping the scope where it is right now, but
> if
> > > >>> others
> > > >>>>> want to advocate for tossing in some more metrics, we can go that
> > > >> route.
> > > >>>>>
> > > >>>>> Thanks all,
> > > >>>>> -John
> > > >>>>>
> > > >>>>> On Fri, Mar 30, 2018 at 2:37 PM, Bill Bejeck <bb...@gmail.com>
> > > >> wrote:
> > > >>>>>
> > > >>>>>> Thanks for the KIP John, and sorry for the late comments.
> > > >>>>>>
> > > >>>>>> I'm on the fence with providing a single level metrics, but I
> > think
> > > >>> we'll
> > > >>>>>> have that discussion outside of this KIP.
> > > >>>>>>
> > > >>>>>>> * maintain one skipped-record metric (could be per-thread,
> > > per-task,
> > > >>> or
> > > >>>>>>> per-processor-node) with no "reason"
> > > >>>>>>> * introduce a warn-level log detailing the
> topic/partition/offset
> > > >> and
> > > >>>>>>> reason of the skipped record
> > > >>>>>>
> > > >>>>>> I'm +1 on both of these suggestions.
> > > >>>>>>
> > > >>>>>> Finally, we have had requests in the past for some metrics
> around
> > > >> when
> > > >>>>>> persistent store removes an expired window.  Would adding that
> to
> > > our
> > > >>>>>> metrics stretch the scope of this KIP too much?
> > > >>>>>>
> > > >>>>>> Thanks again and overall I'm +1 on this KIP
> > > >>>>>>
> > > >>>>>> Bill
> > > >>>>>>
> > > >>>>>> On Fri, Mar 30, 2018 at 2:00 PM, Guozhang Wang <
> > wangguoz@gmail.com>
> > > >>>>>> wrote:
> > > >>>>>>
> > > >>>>>>> The proposal sounds good to me. About "maintain only one level
> of
> > > >>>>>> metrics"
> > > >>>>>>> maybe we can discuss about that separately from this KIP since
> > that
> > > >>>>>> would
> > > >>>>>>> be a larger scope of discussion. I agree that if we are going
> to
> > > >>>>>> maintain
> > > >>>>>>> only one-level metrics it should be lowest level and we would
> let
> > > >>> users
> > > >>>>>> to
> > > >>>>>>> do the roll-ups themselves, but I'm still not fully convinced
> > that
> > > >> we
> > > >>>>>>> should just provide single-level metrics, because 1) I think
> for
> > > >>>>>> different
> > > >>>>>>> metrics people may be interested to investigate into different
> > > >>>>>>> granularities, e.g. for poll / commit rate these are at the
> > lowest
> > > >>>>>>> task-level metrics, while for process-rate / skip-rate they can
> > be
> > > >> as
> > > >>>>>> low
> > > >>>>>>> as processor-node metrics, and 2) user-side rolling ups may not
> > be
> > > >>> very
> > > >>>>>>> straight-forward. But for 2) if someone can provide an
> efficient
> > > and
> > > >>>>>> easy
> > > >>>>>>> implementation of that I can be persuaded :)
> > > >>>>>>>
> > > >>>>>>> For now I'm thinking we can add the metric on thread-level,
> > either
> > > >>> with
> > > >>>>>>> finer grained ones with "reason" tag plus an aggregated one
> > without
> > > >>> the
> > > >>>>>>> tag, or just having a single aggregated metric without the tag
> > > looks
> > > >>>>>> good
> > > >>>>>>> to me.
> > > >>>>>>>
> > > >>>>>>>
> > > >>>>>>> Guozhang
> > > >>>>>>>
> > > >>>>>>> On Fri, Mar 30, 2018 at 8:05 AM, John Roesler <
> john@confluent.io
> > >
> > > >>>>>> wrote:
> > > >>>>>>>
> > > >>>>>>>> Hey Guozhang,
> > > >>>>>>>>
> > > >>>>>>>> Thanks for the reply. Regarding JMX, I can dig it. I'll
> provide
> > a
> > > >>>>>> list in
> > > >>>>>>>> the KIP. I was also thinking we'd better start a documentation
> > > page
> > > >>>>>> with
> > > >>>>>>>> the metrics listed.
> > > >>>>>>>>
> > > >>>>>>>> I'd have no problem logging a warning when we skip records. On
> > the
> > > >>>>>> metric
> > > >>>>>>>> front, really I'm just pushing for us to maintain only one
> level
> > > of
> > > >>>>>>>> metrics. If that's more or less granular (i.e., maybe we don't
> > > >> have a
> > > >>>>>>>> metric per reason and log the reason instead), that's fine by
> > me.
> > > I
> > > >>>>>> just
> > > >>>>>>>> don't think it provides a lot of extra value per complexity
> > > >>> (interface
> > > >>>>>>> and
> > > >>>>>>>> implementation) to maintain roll-ups at the thread level in
> > > >> addition
> > > >>>>>> to
> > > >>>>>>>> lower-level metrics.
> > > >>>>>>>>
> > > >>>>>>>> How about this instead:
> > > >>>>>>>> * maintain one skipped-record metric (could be per-thread,
> > > >> per-task,
> > > >>>>>> or
> > > >>>>>>>> per-processor-node) with no "reason"
> > > >>>>>>>> * introduce a warn-level log detailing the
> > topic/partition/offset
> > > >> and
> > > >>>>>>>> reason of the skipped record
> > > >>>>>>>>
> > > >>>>>>>> If you like that, I can update the KIP.
> > > >>>>>>>>
> > > >>>>>>>> Thanks,
> > > >>>>>>>> -John
> > > >>>>>>>>
> > > >>>>>>>>
> > > >>>>>>>>
> > > >>>>>>>> On Thu, Mar 29, 2018 at 6:22 PM, Guozhang Wang <
> > > wangguoz@gmail.com
> > > >>>
> > > >>>>>>> wrote:
> > > >>>>>>>>
> > > >>>>>>>>>> One thing you mention is the notion of setting alerts on
> > coarser
> > > >>>>>>>> metrics
> > > >>>>>>>>> being easier than finer ones. All the metric alerting
> systems I
> > > >> have
> > > >>>>>>> used
> > > >>>>>>>>> make it equally easy to alert on metrics by-tag or over tags.
> > So
> > > >> my
> > > >>>>>>>>> experience doesn't say that this is a use case. Were you
> > thinking
> > > >>>>>> of an
> > > >>>>>>>>> alerting system that makes such a pre-aggregation valuable?
> > > >>>>>>>>>
> > > >>>>>>>>> For the commonly used JMX reporter tags will be encoded
> > directly
> > > >> as
> > > >>>>>>> part
> > > >>>>>>>> of
> > > >>>>>>>>> the object name, and if users wants to monitor them they need
> > to
> > > >>>>>> know
> > > >>>>>>>> these
> > > >>>>>>>>> values before hand. That is also why I think we do want to
> list
> > > >> all
> > > >>>>>> the
> > > >>>>>>>>> possible values of the reason tags in the KIP, since
> > > >>>>>>>>>
> > > >>>>>>>>>> In my email in response to Matthias, I gave an example of
> the
> > > >>>>>> kind of
> > > >>>>>>>>> scenario that would lead me as an operator to run with DEBUG
> on
> > > >> all
> > > >>>>>> the
> > > >>>>>>>>> time, since I wouldn't be sure, having seen a skipped record
> > > once,
> > > >>>>>> that
> > > >>>>>>>> it
> > > >>>>>>>>> would ever happen again. The solution is to capture all the
> > > >>>>>> available
> > > >>>>>>>>> information about the reason and location of skips all the
> > time.
> > > >>>>>>>>>
> > > >>>>>>>>> That is a good point. I think we can either expose all levels
> > > >>>>>> metrics
> > > >>>>>>> as
> > > >>>>>>>> by
> > > >>>>>>>>> default, or only expose the most lower-level metrics and get
> > rid
> > > >> of
> > > >>>>>>> other
> > > >>>>>>>>> levels to let users do roll-ups themselves (which will be a
> > much
> > > >>>>>> larger
> > > >>>>>>>>> scope for discussion), or we can encourage users to not
> purely
> > > >>>>>> depend
> > > >>>>>>> on
> > > >>>>>>>>> metrics for such trouble shooting: that is to say, users only
> > be
> > > >>>>>>> alerted
> > > >>>>>>>>> based on metrics, and we can log a info / warn log4j entry
> each
> > > >>>>>> time we
> > > >>>>>>>> are
> > > >>>>>>>>> about to skip a record all over the places, so that upon
> being
> > > >>>>>> notified
> > > >>>>>>>>> users can look into the logs to find the details on where /
> > when
> > > >> it
> > > >>>>>>>>> happens. WDYT?
> > > >>>>>>>>>
> > > >>>>>>>>>
> > > >>>>>>>>> Guozhang
> > > >>>>>>>>>
> > > >>>>>>>>>
> > > >>>>>>>>>
> > > >>>>>>>>> On Thu, Mar 29, 2018 at 3:57 PM, John Roesler <
> > john@confluent.io
> > > >
> > > >>>>>>> wrote:
> > > >>>>>>>>>
> > > >>>>>>>>>> Hey Guozhang,
> > > >>>>>>>>>>
> > > >>>>>>>>>> Thanks for the review.
> > > >>>>>>>>>>
> > > >>>>>>>>>> 1.
> > > >>>>>>>>>> Matthias raised the same question about the "reason" tag
> > values.
> > > >> I
> > > >>>>>>> can
> > > >>>>>>>>> list
> > > >>>>>>>>>> all possible values of the "reason" tag, but I'm thinking
> this
> > > >>>>>> level
> > > >>>>>>> of
> > > >>>>>>>>>> detail may not be KIP-worthy, maybe the code and
> documentation
> > > >>>>>> review
> > > >>>>>>>>> would
> > > >>>>>>>>>> be sufficient. If you all disagree and would like it
> included
> > in
> > > >>>>>> the
> > > >>>>>>>>> KIP, I
> > > >>>>>>>>>> can certainly do that.
> > > >>>>>>>>>>
> > > >>>>>>>>>> If we do provide roll-up metrics, I agree with the pattern
> of
> > > >>>>>> keeping
> > > >>>>>>>> the
> > > >>>>>>>>>> same name but eliminating the tags for the dimensions that
> > were
> > > >>>>>>>>> rolled-up.
> > > >>>>>>>>>>
> > > >>>>>>>>>> 2.
> > > >>>>>>>>>> I'm not too sure that implementation efficiency really
> > becomes a
> > > >>>>>>> factor
> > > >>>>>>>>> in
> > > >>>>>>>>>> choosing whether to (by default) update one coarse metric at
> > the
> > > >>>>>>> thread
> > > >>>>>>>>>> level or one granular metric at the processor-node level,
> > since
> > > >>>>>> it's
> > > >>>>>>>> just
> > > >>>>>>>>>> one metric being updated either way. I do agree that if we
> > were
> > > >> to
> > > >>>>>>>> update
> > > >>>>>>>>>> the granular metrics and multiple roll-ups, then we should
> > > >>>>>> consider
> > > >>>>>>> the
> > > >>>>>>>>>> efficiency.
> > > >>>>>>>>>>
> > > >>>>>>>>>> I agree it's probably not necessary to surface the metrics
> for
> > > >> all
> > > >>>>>>>> nodes
> > > >>>>>>>>>> regardless of whether they can or do skip records. Perhaps
> we
> > > can
> > > >>>>>>>> lazily
> > > >>>>>>>>>> register the metrics.
> > > >>>>>>>>>>
> > > >>>>>>>>>> In my email in response to Matthias, I gave an example of
> the
> > > >>>>>> kind of
> > > >>>>>>>>>> scenario that would lead me as an operator to run with DEBUG
> > on
> > > >>>>>> all
> > > >>>>>>> the
> > > >>>>>>>>>> time, since I wouldn't be sure, having seen a skipped record
> > > >> once,
> > > >>>>>>> that
> > > >>>>>>>>> it
> > > >>>>>>>>>> would ever happen again. The solution is to capture all the
> > > >>>>>> available
> > > >>>>>>>>>> information about the reason and location of skips all the
> > time.
> > > >>>>>>>>>>
> > > >>>>>>>>>>
> > > >>>>>>>>>>
> > > >>>>>>>>>> One thing you mention is the notion of setting alerts on
> > coarser
> > > >>>>>>>> metrics
> > > >>>>>>>>>> being easier than finer ones. All the metric alerting
> systems
> > I
> > > >>>>>> have
> > > >>>>>>>> used
> > > >>>>>>>>>> make it equally easy to alert on metrics by-tag or over
> tags.
> > So
> > > >>>>>> my
> > > >>>>>>>>>> experience doesn't say that this is a use case. Were you
> > > thinking
> > > >>>>>> of
> > > >>>>>>> an
> > > >>>>>>>>>> alerting system that makes such a pre-aggregation valuable?
> > > >>>>>>>>>>
> > > >>>>>>>>>> Thanks again,
> > > >>>>>>>>>> -John
> > > >>>>>>>>>>
> > > >>>>>>>>>>
> > > >>>>>>>>>> On Thu, Mar 29, 2018 at 5:24 PM, Guozhang Wang <
> > > >>>>>> wangguoz@gmail.com>
> > > >>>>>>>>> wrote:
> > > >>>>>>>>>>
> > > >>>>>>>>>>> Hello John,
> > > >>>>>>>>>>>
> > > >>>>>>>>>>> Thanks for the KIP. Some comments:
> > > >>>>>>>>>>>
> > > >>>>>>>>>>> 1. Could you list all the possible values of the "reason"
> > tag?
> > > >>>>>> In
> > > >>>>>>> the
> > > >>>>>>>>>> JIRA
> > > >>>>>>>>>>> ticket I left some potential reasons but I'm not clear if
> > > you're
> > > >>>>>>>> going
> > > >>>>>>>>> to
> > > >>>>>>>>>>> categorize each of them as a separate reason, or is there
> any
> > > >>>>>>>>> additional
> > > >>>>>>>>>>> ones you have in mind.
> > > >>>>>>>>>>>
> > > >>>>>>>>>>> Also I'm wondering if we should add another metric that do
> > not
> > > >>>>>> have
> > > >>>>>>>> the
> > > >>>>>>>>>>> reason tag but aggregates among all possible reasons? This
> is
> > > >>>>>> for
> > > >>>>>>>> users
> > > >>>>>>>>>> to
> > > >>>>>>>>>>> easily set their alerting notifications (otherwise they
> have
> > to
> > > >>>>>>> write
> > > >>>>>>>>> on
> > > >>>>>>>>>>> notification rule per reason) in their monitoring systems.
> > > >>>>>>>>>>>
> > > >>>>>>>>>>> 2. Note that the processor-node metrics is actually
> > > "per-thread,
> > > >>>>>>>>>> per-task,
> > > >>>>>>>>>>> per-processor-node", and today we only set the per-thread
> > > >>>>>> metrics
> > > >>>>>>> as
> > > >>>>>>>>> INFO
> > > >>>>>>>>>>> while leaving the lower two layers as DEBUG. I agree with
> > your
> > > >>>>>>>> argument
> > > >>>>>>>>>>> that we are missing the per-client roll-up metrics today,
> but
> > > >>>>>> I'm
> > > >>>>>>>>>> convinced
> > > >>>>>>>>>>> that the right way to approach it would be
> > > >>>>>>>> "just-providing-the-lowest-
> > > >>>>>>>>>>> level
> > > >>>>>>>>>>> metrics only".
> > > >>>>>>>>>>>
> > > >>>>>>>>>>> Note the recoding implementation of these three levels are
> > > >>>>>>> different
> > > >>>>>>>>>>> internally today: we did not just do the rolling up to
> > generate
> > > >>>>>> the
> > > >>>>>>>>>>> higher-level metrics from the lower level ones, but we just
> > > >>>>>> record
> > > >>>>>>>> them
> > > >>>>>>>>>>> separately, which means that, if we turn on multiple levels
> > of
> > > >>>>>>>> metrics,
> > > >>>>>>>>>> we
> > > >>>>>>>>>>> maybe duplicate collecting some metrics. One can argue that
> > is
> > > >>>>>> not
> > > >>>>>>>> the
> > > >>>>>>>>>> best
> > > >>>>>>>>>>> way to represent multi-level metrics collecting and
> > reporting,
> > > >>>>>> but
> > > >>>>>>> by
> > > >>>>>>>>>> only
> > > >>>>>>>>>>> enabling thread-level metrics as INFO today, that
> > > implementation
> > > >>>>>>>> could
> > > >>>>>>>>> be
> > > >>>>>>>>>>> more efficient than only collecting the metrics at the
> lowest
> > > >>>>>>> level,
> > > >>>>>>>>> and
> > > >>>>>>>>>>> then do the roll-up calculations outside of the metrics
> > > classes.
> > > >>>>>>>>>>>
> > > >>>>>>>>>>> Plus, today not all processor-nodes may possibly skip
> > records,
> > > >>>>>>> AFAIK
> > > >>>>>>>> we
> > > >>>>>>>>>>> will only skip records at the source, sink, window and
> > > >>>>>> aggregation
> > > >>>>>>>>>>> processor nodes, so adding a metric per processor looks
> like
> > an
> > > >>>>>>>>> overkill
> > > >>>>>>>>>> to
> > > >>>>>>>>>>> me as well. On the other hand, from user's perspective the
> > > >>>>>> "reason"
> > > >>>>>>>> tag
> > > >>>>>>>>>> may
> > > >>>>>>>>>>> be sufficient for them to narrow down where inside the
> > topology
> > > >>>>>> is
> > > >>>>>>>>>> causing
> > > >>>>>>>>>>> records to be dropped on the floor. So I think the
> > "per-thread,
> > > >>>>>>>>> per-task"
> > > >>>>>>>>>>> level metrics should be sufficient for them in trouble
> shoot
> > in
> > > >>>>>>> DEBUG
> > > >>>>>>>>>> mode,
> > > >>>>>>>>>>> and we can add another "per-thread" level metrics as INFO
> > which
> > > >>>>>> is
> > > >>>>>>>>> turned
> > > >>>>>>>>>>> on by default. So under normal execution users still only
> > need
> > > >>>>>> INFO
> > > >>>>>>>>> level
> > > >>>>>>>>>>> metrics for alerting (e.g. set alerts on all
> skipped-records
> > > >>>>>>> metrics
> > > >>>>>>>> as
> > > >>>>>>>>>>> non-zero), and then upon trouble shooting they can turn on
> > > DEBUG
> > > >>>>>>>>> metrics
> > > >>>>>>>>>> to
> > > >>>>>>>>>>> look into which task is actually causing the skipped
> records.
> > > >>>>>>>>>>>
> > > >>>>>>>>>>>
> > > >>>>>>>>>>> Guozhang
> > > >>>>>>>>>>>
> > > >>>>>>>>>>>
> > > >>>>>>>>>>>
> > > >>>>>>>>>>>
> > > >>>>>>>>>>> On Thu, Mar 29, 2018 at 2:03 PM, Matthias J. Sax <
> > > >>>>>>>>> matthias@confluent.io>
> > > >>>>>>>>>>> wrote:
> > > >>>>>>>>>>>
> > > >>>>>>>>>>>> Thanks for the KIP John.
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>> Reading the material on the related Jiras, I am wondering
> > what
> > > >>>>>>>>> `reason`
> > > >>>>>>>>>>>> tags you want to introduce? Can you elaborate? The KIP
> > should
> > > >>>>>>> list
> > > >>>>>>>>>> those
> > > >>>>>>>>>>>> IMHO.
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>> About the fine grained metrics vs the roll-up: you say
> that
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>>> the coarse metric aggregates across two dimensions
> > > >>>>>>> simultaneously
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>> Can you elaborate why this is an issue? I am not convinced
> > atm
> > > >>>>>>> that
> > > >>>>>>>>> we
> > > >>>>>>>>>>>> should put the fine grained metrics into INFO level and
> > remove
> > > >>>>>>> the
> > > >>>>>>>>>>>> roll-up at thread level.
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>>> Given that they have to do this sum to get a usable
> > > >>>>>> top-level
> > > >>>>>>>> view
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>> This is a fair concern, but I don't share the conclusion.
> > > >>>>>>> Offering
> > > >>>>>>>> a
> > > >>>>>>>>>>>> built-in `KafkaStreams` "client" roll-up out of the box
> > might
> > > >>>>>> be
> > > >>>>>>> a
> > > >>>>>>>>>>>> better solution. In the past we did not offer this due to
> > > >>>>>>>> performance
> > > >>>>>>>>>>>> concerns, but we could allow an "opt-in" mechanism. If you
> > > >>>>>>>> disagree,
> > > >>>>>>>>>> can
> > > >>>>>>>>>>>> you provide some reasoning and add them to the "Rejected
> > > >>>>>>>>> alternatives"
> > > >>>>>>>>>>>> section.
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>> To rephrase: I understand the issue about missing
> top-level
> > > >>>>>> view,
> > > >>>>>>>> but
> > > >>>>>>>>>>>> instead of going more fine grained, we should consider to
> > add
> > > >>>>>>> this
> > > >>>>>>>>>>>> top-level view and add/keep the fine grained metrics at
> > DEBUG
> > > >>>>>>> level
> > > >>>>>>>>>> only
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>> I am +1 to add TopologyTestDriver#metrics() and to remove
> > old
> > > >>>>>>>> metrics
> > > >>>>>>>>>>>> directly as you suggested.
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>> -Matthias
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>> On 3/28/18 6:42 PM, Ted Yu wrote:
> > > >>>>>>>>>>>>> Looks good to me.
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>> On Wed, Mar 28, 2018 at 3:11 PM, John Roesler <
> > > >>>>>>> john@confluent.io
> > > >>>>>>>>>
> > > >>>>>>>>>>> wrote:
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>>> Hello all,
> > > >>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>> I am proposing KIP-274 to improve the metrics around
> > > >>>>>> skipped
> > > >>>>>>>>> records
> > > >>>>>>>>>>> in
> > > >>>>>>>>>>>>>> Streams.
> > > >>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>> Please find the details here:
> > > >>>>>>>>>>>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-
> > > >>>>>>>>>>>>>> 274%3A+Kafka+Streams+Skipped+Records+Metrics
> > > >>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>> Please let me know what you think!
> > > >>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>> Thanks,
> > > >>>>>>>>>>>>>> -John
> > > >>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>
> > > >>>>>>>>>>>
> > > >>>>>>>>>>> --
> > > >>>>>>>>>>> -- Guozhang
> > > >>>>>>>>>>>
> > > >>>>>>>>>>
> > > >>>>>>>>>
> > > >>>>>>>>>
> > > >>>>>>>>>
> > > >>>>>>>>> --
> > > >>>>>>>>> -- Guozhang
> > > >>>>>>>>>
> > > >>>>>>>>
> > > >>>>>>>
> > > >>>>>>>
> > > >>>>>>>
> > > >>>>>>> --
> > > >>>>>>> -- Guozhang
> > > >>>>>>>
> > > >>>>>>
> > > >>>>>
> > > >>>>>
> > > >>>>
> > > >>>
> > > >>>
> > > >>
> > > >
> > > >
> > > >
> > >
> > >
> >
>
>
>
> --
> -- Guozhang
>

Re: [DISCUSS] KIP-274: Kafka Streams Skipped Records Metrics

Posted by Guozhang Wang <wa...@gmail.com>.

I think Matthias' comment is that, we can still record the metrics on the
thread-level, while having the WARN log entry to include sufficient context
information so that users can still easily narrow down the investigation
scope.


Guozhang

On Tue, Apr 3, 2018 at 11:22 AM, John Roesler <jo...@confluent.io> wrote:

> I agree we should add as much information as is reasonable to the log. For
> example, see this WIP PR I started for this KIP:
>
> https://github.com/apache/kafka/pull/4812/files#diff-
> 88d129f048bc842c7db5b2566a45fce8R80
>
> and
>
> https://github.com/apache/kafka/pull/4812/files#diff-
> 69e6789eb675ec978a1abd24fed96eb1R111
>
> I'm not sure if we should nail down the log messages in the KIP or in the
> PR discussion. What say you?
>
> Thanks,
> -John
>
> On Tue, Apr 3, 2018 at 12:20 AM, Matthias J. Sax <ma...@confluent.io>
> wrote:
>
> > Thanks for sharing your thoughts. As I mentioned originally, I am not
> > sure about the right log level either. Your arguments are convincing --
> > thus, I am fine with keeping WARN level.
> >
> > The task vs thread level argument is an interesting one. However, I am
> > wondering if we should add this information into the corresponding WARN
> > logs that we write anyway? For this case, we can also log the
> > corresponding operator (and other information like topic name etc if
> > needed). WDYT about this?
> >
> >
> > -Matthias
> >
> > On 4/2/18 8:31 PM, Guozhang Wang wrote:
> > > Regarding logging: I'm inclined to keep logging at WARN level since
> > skipped
> > > records are not expected in normal execution (for all reasons that we
> are
> > > aware of), and hence when error happens users should be alerted from
> > > metrics and looked into the log files, so to me if it is really
> spamming
> > > the log files it is also a good alert for users. Besides for
> deserialize
> > > errors we already log at WARN level for this reason.
> > >
> > > Regarding the metrics-levels: I was pondering on that as well. What
> made
> > me
> > > to think and agree on task-level than thread-level is that for some
> > reasons
> > > like window retention, they may possibly be happening on a subset of
> > input
> > > partitions, and tasks are correlated with partitions the task-level
> > metrics
> > > can help users to narrow down on the specific input data partitions.
> > >
> > >
> > > Guozhang
> > >
> > >
> > > On Mon, Apr 2, 2018 at 6:43 PM, John Roesler <jo...@confluent.io>
> wrote:
> > >
> > >> Hi Matthias,
> > >>
> > >> No worries! Thanks for the reply.
> > >>
> > >> 1) There isn't a connection. I tried using the TopologyTestDriver to
> > write
> > >> a quick test exercising the current behavior and discovered that the
> > >> metrics weren't available. It seemed like they should be, so I tacked
> > it on
> > >> to this KIP. If you feel it's inappropriate, I can pull it back out.
> > >>
> > >> 2) I was also concerned about that, but I figured it would come up in
> > >> discussion if I just went ahead and proposed it. And here we are!
> > >>
> > >> Here's my thought: maybe there are two classes of skips: "controlled"
> > and
> > >> "uncontrolled", where "controlled" means, as an app author, I
> > deliberately
> > >> filter out some events, and "uncontrolled" means that I simply don't
> > >> account for some feature of the data, and the framework skips them (as
> > >> opposed to crashing).
> > >>
> > >> In this breakdowns, the skips I'm adding metrics for are all
> > uncontrolled
> > >> skips (and we hope to measure all the uncontrolled skips). Our skips
> are
> > >> well documented, so it wouldn't be terrible to have an application in
> > which
> > >> you know you expect to have tons of uncontrolled skips, but it's not
> > great
> > >> either, since you may also have some *unexpected* uncontrolled skips.
> > It'll
> > >> be difficult to notice, since you're probably not alerting on the
> metric
> > >> and filtering out the logs (whatever their level).
> > >>
> > >> I'd recommend any app author, as an alternative, to convert all
> expected
> > >> skips to controlled ones, by updating the topology to filter those
> > records
> > >> out.
> > >>
> > >> Following from my recommendation, as a library author, I'm inclined to
> > mark
> > >> those logs WARN, since in my opinion, they should be concerning to the
> > app
> > >> authors. I'd definitely want to show, rather than hide, them by
> > default, so
> > >> I would pick INFO at least.
> > >>
> > >> That said, logging is always a tricky issue for lower-level libraries
> > that
> > >> run inside user code, since we don't have all the information we need
> to
> > >> make the right call.
> > >>
> > >>
> > >>
> > >> On your last note, yeah, I got that impression from Guozhang as well.
> > >> Thanks for the clarification.
> > >>
> > >> -John
> > >>
> > >>
> > >>
> > >> On Mon, Apr 2, 2018 at 4:03 PM, Matthias J. Sax <
> matthias@confluent.io>
> > >> wrote:
> > >>
> > >>> John,
> > >>>
> > >>> sorry for my late reply and thanks for updating the KIP.
> > >>>
> > >>> I like your approach about "metrics are for monitoring, logs are for
> > >>> debugging" -- however:
> > >>>
> > >>> 1) I don't see a connection between this and the task-level metrics
> > that
> > >>> you propose to get the metrics in `TopologyTestDriver`. I don't think
> > >>> people would monitor the `TopologyTestDriver` an thus wondering why
> it
> > >>> is important to include the metrics there? Thread-level metric might
> be
> > >>> easier to monitor though (ie, less different metric to monitor).
> > >>>
> > >>> 2) I am a little worried about WARN level logging and that it might
> be
> > >>> too chatty -- as you pointed out, it's about debugging, thus DEBUG
> > level
> > >>> might be better. Not 100% sure about this to be honest. What is the
> > >>> general assumption about the frequency for skipped records? I could
> > >>> imagine cases for which skipped records are quite frequent and thus,
> > >>> WARN level logs might "flood" the logs
> > >>>
> > >>> One final remark:
> > >>>
> > >>>> More
> > >>>> generally, I would like to establish a pattern in which we could add
> > >> new
> > >>>> values for the "reason" tags without needing a KIP to do so.
> > >>>
> > >>> From my understanding, this is not feasible. Changing metrics is
> always
> > >>> considered a public API change, and we need a KIP for any change. As
> we
> > >>> moved away from tagging, it doesn't matter for the KIP anymore --
> just
> > >>> wanted to point it out.
> > >>>
> > >>>
> > >>> -Matthias
> > >>>
> > >>>
> > >>> On 3/30/18 2:47 PM, John Roesler wrote:
> > >>>> Allrighty! The KIP is updated.
> > >>>>
> > >>>> Thanks again, all, for the feedback.
> > >>>> -John
> > >>>>
> > >>>> On Fri, Mar 30, 2018 at 3:35 PM, John Roesler <jo...@confluent.io>
> > >> wrote:
> > >>>>
> > >>>>> Hey Guozhang and Bill,
> > >>>>>
> > >>>>> Ok, I'll update the KIP. At the risk of disturbing consensus, I'd
> > like
> > >>> to
> > >>>>> put it in the task instead of the thread so that it'll show up in
> the
> > >>>>> TopologyTestDriver metrics as well.
> > >>>>>
> > >>>>> I'm leaning toward keeping the scope where it is right now, but if
> > >>> others
> > >>>>> want to advocate for tossing in some more metrics, we can go that
> > >> route.
> > >>>>>
> > >>>>> Thanks all,
> > >>>>> -John
> > >>>>>
> > >>>>> On Fri, Mar 30, 2018 at 2:37 PM, Bill Bejeck <bb...@gmail.com>
> > >> wrote:
> > >>>>>
> > >>>>>> Thanks for the KIP John, and sorry for the late comments.
> > >>>>>>
> > >>>>>> I'm on the fence with providing a single level metrics, but I
> think
> > >>> we'll
> > >>>>>> have that discussion outside of this KIP.
> > >>>>>>
> > >>>>>>> * maintain one skipped-record metric (could be per-thread,
> > per-task,
> > >>> or
> > >>>>>>> per-processor-node) with no "reason"
> > >>>>>>> * introduce a warn-level log detailing the topic/partition/offset
> > >> and
> > >>>>>>> reason of the skipped record
> > >>>>>>
> > >>>>>> I'm +1 on both of these suggestions.
> > >>>>>>
> > >>>>>> Finally, we have had requests in the past for some metrics around
> > >> when
> > >>>>>> persistent store removes an expired window.  Would adding that to
> > our
> > >>>>>> metrics stretch the scope of this KIP too much?
> > >>>>>>
> > >>>>>> Thanks again and overall I'm +1 on this KIP
> > >>>>>>
> > >>>>>> Bill
> > >>>>>>
> > >>>>>> On Fri, Mar 30, 2018 at 2:00 PM, Guozhang Wang <
> wangguoz@gmail.com>
> > >>>>>> wrote:
> > >>>>>>
> > >>>>>>> The proposal sounds good to me. About "maintain only one level of
> > >>>>>> metrics"
> > >>>>>>> maybe we can discuss about that separately from this KIP since
> that
> > >>>>>> would
> > >>>>>>> be a larger scope of discussion. I agree that if we are going to
> > >>>>>> maintain
> > >>>>>>> only one-level metrics it should be lowest level and we would let
> > >>> users
> > >>>>>> to
> > >>>>>>> do the roll-ups themselves, but I'm still not fully convinced
> that
> > >> we
> > >>>>>>> should just provide single-level metrics, because 1) I think for
> > >>>>>> different
> > >>>>>>> metrics people may be interested to investigate into different
> > >>>>>>> granularities, e.g. for poll / commit rate these are at the
> lowest
> > >>>>>>> task-level metrics, while for process-rate / skip-rate they can
> be
> > >> as
> > >>>>>> low
> > >>>>>>> as processor-node metrics, and 2) user-side rolling ups may not
> be
> > >>> very
> > >>>>>>> straight-forward. But for 2) if someone can provide an efficient
> > and
> > >>>>>> easy
> > >>>>>>> implementation of that I can be persuaded :)
> > >>>>>>>
> > >>>>>>> For now I'm thinking we can add the metric on thread-level,
> either
> > >>> with
> > >>>>>>> finer grained ones with "reason" tag plus an aggregated one
> without
> > >>> the
> > >>>>>>> tag, or just having a single aggregated metric without the tag
> > looks
> > >>>>>> good
> > >>>>>>> to me.
> > >>>>>>>
> > >>>>>>>
> > >>>>>>> Guozhang
> > >>>>>>>
> > >>>>>>> On Fri, Mar 30, 2018 at 8:05 AM, John Roesler <john@confluent.io
> >
> > >>>>>> wrote:
> > >>>>>>>
> > >>>>>>>> Hey Guozhang,
> > >>>>>>>>
> > >>>>>>>> Thanks for the reply. Regarding JMX, I can dig it. I'll provide
> a
> > >>>>>> list in
> > >>>>>>>> the KIP. I was also thinking we'd better start a documentation
> > page
> > >>>>>> with
> > >>>>>>>> the metrics listed.
> > >>>>>>>>
> > >>>>>>>> I'd have no problem logging a warning when we skip records. On
> the
> > >>>>>> metric
> > >>>>>>>> front, really I'm just pushing for us to maintain only one level
> > of
> > >>>>>>>> metrics. If that's more or less granular (i.e., maybe we don't
> > >> have a
> > >>>>>>>> metric per reason and log the reason instead), that's fine by
> me.
> > I
> > >>>>>> just
> > >>>>>>>> don't think it provides a lot of extra value per complexity
> > >>> (interface
> > >>>>>>> and
> > >>>>>>>> implementation) to maintain roll-ups at the thread level in
> > >> addition
> > >>>>>> to
> > >>>>>>>> lower-level metrics.
> > >>>>>>>>
> > >>>>>>>> How about this instead:
> > >>>>>>>> * maintain one skipped-record metric (could be per-thread,
> > >> per-task,
> > >>>>>> or
> > >>>>>>>> per-processor-node) with no "reason"
> > >>>>>>>> * introduce a warn-level log detailing the
> topic/partition/offset
> > >> and
> > >>>>>>>> reason of the skipped record
> > >>>>>>>>
> > >>>>>>>> If you like that, I can update the KIP.
> > >>>>>>>>
> > >>>>>>>> Thanks,
> > >>>>>>>> -John
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>>> On Thu, Mar 29, 2018 at 6:22 PM, Guozhang Wang <
> > wangguoz@gmail.com
> > >>>
> > >>>>>>> wrote:
> > >>>>>>>>
> > >>>>>>>>>> One thing you mention is the notion of setting alerts on
> coarser
> > >>>>>>>> metrics
> > >>>>>>>>> being easier than finer ones. All the metric alerting systems I
> > >> have
> > >>>>>>> used
> > >>>>>>>>> make it equally easy to alert on metrics by-tag or over tags.
> So
> > >> my
> > >>>>>>>>> experience doesn't say that this is a use case. Were you
> thinking
> > >>>>>> of an
> > >>>>>>>>> alerting system that makes such a pre-aggregation valuable?
> > >>>>>>>>>
> > >>>>>>>>> For the commonly used JMX reporter tags will be encoded
> directly
> > >> as
> > >>>>>>> part
> > >>>>>>>> of
> > >>>>>>>>> the object name, and if users wants to monitor them they need
> to
> > >>>>>> know
> > >>>>>>>> these
> > >>>>>>>>> values before hand. That is also why I think we do want to list
> > >> all
> > >>>>>> the
> > >>>>>>>>> possible values of the reason tags in the KIP, since
> > >>>>>>>>>
> > >>>>>>>>>> In my email in response to Matthias, I gave an example of the
> > >>>>>> kind of
> > >>>>>>>>> scenario that would lead me as an operator to run with DEBUG on
> > >> all
> > >>>>>> the
> > >>>>>>>>> time, since I wouldn't be sure, having seen a skipped record
> > once,
> > >>>>>> that
> > >>>>>>>> it
> > >>>>>>>>> would ever happen again. The solution is to capture all the
> > >>>>>> available
> > >>>>>>>>> information about the reason and location of skips all the
> time.
> > >>>>>>>>>
> > >>>>>>>>> That is a good point. I think we can either expose all levels
> > >>>>>> metrics
> > >>>>>>> as
> > >>>>>>>> by
> > >>>>>>>>> default, or only expose the most lower-level metrics and get
> rid
> > >> of
> > >>>>>>> other
> > >>>>>>>>> levels to let users do roll-ups themselves (which will be a
> much
> > >>>>>> larger
> > >>>>>>>>> scope for discussion), or we can encourage users to not purely
> > >>>>>> depend
> > >>>>>>> on
> > >>>>>>>>> metrics for such trouble shooting: that is to say, users only
> be
> > >>>>>>> alerted
> > >>>>>>>>> based on metrics, and we can log a info / warn log4j entry each
> > >>>>>> time we
> > >>>>>>>> are
> > >>>>>>>>> about to skip a record all over the places, so that upon being
> > >>>>>> notified
> > >>>>>>>>> users can look into the logs to find the details on where /
> when
> > >> it
> > >>>>>>>>> happens. WDYT?
> > >>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>> Guozhang
> > >>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>> On Thu, Mar 29, 2018 at 3:57 PM, John Roesler <
> john@confluent.io
> > >
> > >>>>>>> wrote:
> > >>>>>>>>>
> > >>>>>>>>>> Hey Guozhang,
> > >>>>>>>>>>
> > >>>>>>>>>> Thanks for the review.
> > >>>>>>>>>>
> > >>>>>>>>>> 1.
> > >>>>>>>>>> Matthias raised the same question about the "reason" tag
> values.
> > >> I
> > >>>>>>> can
> > >>>>>>>>> list
> > >>>>>>>>>> all possible values of the "reason" tag, but I'm thinking this
> > >>>>>> level
> > >>>>>>> of
> > >>>>>>>>>> detail may not be KIP-worthy, maybe the code and documentation
> > >>>>>> review
> > >>>>>>>>> would
> > >>>>>>>>>> be sufficient. If you all disagree and would like it included
> in
> > >>>>>> the
> > >>>>>>>>> KIP, I
> > >>>>>>>>>> can certainly do that.
> > >>>>>>>>>>
> > >>>>>>>>>> If we do provide roll-up metrics, I agree with the pattern of
> > >>>>>> keeping
> > >>>>>>>> the
> > >>>>>>>>>> same name but eliminating the tags for the dimensions that
> were
> > >>>>>>>>> rolled-up.
> > >>>>>>>>>>
> > >>>>>>>>>> 2.
> > >>>>>>>>>> I'm not too sure that implementation efficiency really
> becomes a
> > >>>>>>> factor
> > >>>>>>>>> in
> > >>>>>>>>>> choosing whether to (by default) update one coarse metric at
> the
> > >>>>>>> thread
> > >>>>>>>>>> level or one granular metric at the processor-node level,
> since
> > >>>>>> it's
> > >>>>>>>> just
> > >>>>>>>>>> one metric being updated either way. I do agree that if we
> were
> > >> to
> > >>>>>>>> update
> > >>>>>>>>>> the granular metrics and multiple roll-ups, then we should
> > >>>>>> consider
> > >>>>>>> the
> > >>>>>>>>>> efficiency.
> > >>>>>>>>>>
> > >>>>>>>>>> I agree it's probably not necessary to surface the metrics for
> > >> all
> > >>>>>>>> nodes
> > >>>>>>>>>> regardless of whether they can or do skip records. Perhaps we
> > can
> > >>>>>>>> lazily
> > >>>>>>>>>> register the metrics.
> > >>>>>>>>>>
> > >>>>>>>>>> In my email in response to Matthias, I gave an example of the
> > >>>>>> kind of
> > >>>>>>>>>> scenario that would lead me as an operator to run with DEBUG
> on
> > >>>>>> all
> > >>>>>>> the
> > >>>>>>>>>> time, since I wouldn't be sure, having seen a skipped record
> > >> once,
> > >>>>>>> that
> > >>>>>>>>> it
> > >>>>>>>>>> would ever happen again. The solution is to capture all the
> > >>>>>> available
> > >>>>>>>>>> information about the reason and location of skips all the
> time.
> > >>>>>>>>>>
> > >>>>>>>>>>
> > >>>>>>>>>>
> > >>>>>>>>>> One thing you mention is the notion of setting alerts on
> coarser
> > >>>>>>>> metrics
> > >>>>>>>>>> being easier than finer ones. All the metric alerting systems
> I
> > >>>>>> have
> > >>>>>>>> used
> > >>>>>>>>>> make it equally easy to alert on metrics by-tag or over tags.
> So
> > >>>>>> my
> > >>>>>>>>>> experience doesn't say that this is a use case. Were you
> > thinking
> > >>>>>> of
> > >>>>>>> an
> > >>>>>>>>>> alerting system that makes such a pre-aggregation valuable?
> > >>>>>>>>>>
> > >>>>>>>>>> Thanks again,
> > >>>>>>>>>> -John
> > >>>>>>>>>>
> > >>>>>>>>>>
> > >>>>>>>>>> On Thu, Mar 29, 2018 at 5:24 PM, Guozhang Wang <
> > >>>>>> wangguoz@gmail.com>
> > >>>>>>>>> wrote:
> > >>>>>>>>>>
> > >>>>>>>>>>> Hello John,
> > >>>>>>>>>>>
> > >>>>>>>>>>> Thanks for the KIP. Some comments:
> > >>>>>>>>>>>
> > >>>>>>>>>>> 1. Could you list all the possible values of the "reason"
> tag?
> > >>>>>> In
> > >>>>>>> the
> > >>>>>>>>>> JIRA
> > >>>>>>>>>>> ticket I left some potential reasons but I'm not clear if
> > you're
> > >>>>>>>> going
> > >>>>>>>>> to
> > >>>>>>>>>>> categorize each of them as a separate reason, or is there any
> > >>>>>>>>> additional
> > >>>>>>>>>>> ones you have in mind.
> > >>>>>>>>>>>
> > >>>>>>>>>>> Also I'm wondering if we should add another metric that do
> not
> > >>>>>> have
> > >>>>>>>> the
> > >>>>>>>>>>> reason tag but aggregates among all possible reasons? This is
> > >>>>>> for
> > >>>>>>>> users
> > >>>>>>>>>> to
> > >>>>>>>>>>> easily set their alerting notifications (otherwise they have
> to
> > >>>>>>> write
> > >>>>>>>>> on
> > >>>>>>>>>>> notification rule per reason) in their monitoring systems.
> > >>>>>>>>>>>
> > >>>>>>>>>>> 2. Note that the processor-node metrics is actually
> > "per-thread,
> > >>>>>>>>>> per-task,
> > >>>>>>>>>>> per-processor-node", and today we only set the per-thread
> > >>>>>> metrics
> > >>>>>>> as
> > >>>>>>>>> INFO
> > >>>>>>>>>>> while leaving the lower two layers as DEBUG. I agree with
> your
> > >>>>>>>> argument
> > >>>>>>>>>>> that we are missing the per-client roll-up metrics today, but
> > >>>>>> I'm
> > >>>>>>>>>> convinced
> > >>>>>>>>>>> that the right way to approach it would be
> > >>>>>>>> "just-providing-the-lowest-
> > >>>>>>>>>>> level
> > >>>>>>>>>>> metrics only".
> > >>>>>>>>>>>
> > >>>>>>>>>>> Note the recoding implementation of these three levels are
> > >>>>>>> different
> > >>>>>>>>>>> internally today: we did not just do the rolling up to
> generate
> > >>>>>> the
> > >>>>>>>>>>> higher-level metrics from the lower level ones, but we just
> > >>>>>> record
> > >>>>>>>> them
> > >>>>>>>>>>> separately, which means that, if we turn on multiple levels
> of
> > >>>>>>>> metrics,
> > >>>>>>>>>> we
> > >>>>>>>>>>> maybe duplicate collecting some metrics. One can argue that
> is
> > >>>>>> not
> > >>>>>>>> the
> > >>>>>>>>>> best
> > >>>>>>>>>>> way to represent multi-level metrics collecting and
> reporting,
> > >>>>>> but
> > >>>>>>> by
> > >>>>>>>>>> only
> > >>>>>>>>>>> enabling thread-level metrics as INFO today, that
> > implementation
> > >>>>>>>> could
> > >>>>>>>>> be
> > >>>>>>>>>>> more efficient than only collecting the metrics at the lowest
> > >>>>>>> level,
> > >>>>>>>>> and
> > >>>>>>>>>>> then do the roll-up calculations outside of the metrics
> > classes.
> > >>>>>>>>>>>
> > >>>>>>>>>>> Plus, today not all processor-nodes may possibly skip
> records,
> > >>>>>>> AFAIK
> > >>>>>>>> we
> > >>>>>>>>>>> will only skip records at the source, sink, window and
> > >>>>>> aggregation
> > >>>>>>>>>>> processor nodes, so adding a metric per processor looks like
> an
> > >>>>>>>>> overkill
> > >>>>>>>>>> to
> > >>>>>>>>>>> me as well. On the other hand, from user's perspective the
> > >>>>>> "reason"
> > >>>>>>>> tag
> > >>>>>>>>>> may
> > >>>>>>>>>>> be sufficient for them to narrow down where inside the
> topology
> > >>>>>> is
> > >>>>>>>>>> causing
> > >>>>>>>>>>> records to be dropped on the floor. So I think the
> "per-thread,
> > >>>>>>>>> per-task"
> > >>>>>>>>>>> level metrics should be sufficient for them in trouble shoot
> in
> > >>>>>>> DEBUG
> > >>>>>>>>>> mode,
> > >>>>>>>>>>> and we can add another "per-thread" level metrics as INFO
> which
> > >>>>>> is
> > >>>>>>>>> turned
> > >>>>>>>>>>> on by default. So under normal execution users still only
> need
> > >>>>>> INFO
> > >>>>>>>>> level
> > >>>>>>>>>>> metrics for alerting (e.g. set alerts on all skipped-records
> > >>>>>>> metrics
> > >>>>>>>> as
> > >>>>>>>>>>> non-zero), and then upon trouble shooting they can turn on
> > DEBUG
> > >>>>>>>>> metrics
> > >>>>>>>>>> to
> > >>>>>>>>>>> look into which task is actually causing the skipped records.
> > >>>>>>>>>>>
> > >>>>>>>>>>>
> > >>>>>>>>>>> Guozhang
> > >>>>>>>>>>>
> > >>>>>>>>>>>
> > >>>>>>>>>>>
> > >>>>>>>>>>>
> > >>>>>>>>>>> On Thu, Mar 29, 2018 at 2:03 PM, Matthias J. Sax <
> > >>>>>>>>> matthias@confluent.io>
> > >>>>>>>>>>> wrote:
> > >>>>>>>>>>>
> > >>>>>>>>>>>> Thanks for the KIP John.
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> Reading the material on the related Jiras, I am wondering
> what
> > >>>>>>>>> `reason`
> > >>>>>>>>>>>> tags you want to introduce? Can you elaborate? The KIP
> should
> > >>>>>>> list
> > >>>>>>>>>> those
> > >>>>>>>>>>>> IMHO.
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> About the fine grained metrics vs the roll-up: you say that
> > >>>>>>>>>>>>
> > >>>>>>>>>>>>> the coarse metric aggregates across two dimensions
> > >>>>>>> simultaneously
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> Can you elaborate why this is an issue? I am not convinced
> atm
> > >>>>>>> that
> > >>>>>>>>> we
> > >>>>>>>>>>>> should put the fine grained metrics into INFO level and
> remove
> > >>>>>>> the
> > >>>>>>>>>>>> roll-up at thread level.
> > >>>>>>>>>>>>
> > >>>>>>>>>>>>> Given that they have to do this sum to get a usable
> > >>>>>> top-level
> > >>>>>>>> view
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> This is a fair concern, but I don't share the conclusion.
> > >>>>>>> Offering
> > >>>>>>>> a
> > >>>>>>>>>>>> built-in `KafkaStreams` "client" roll-up out of the box
> might
> > >>>>>> be
> > >>>>>>> a
> > >>>>>>>>>>>> better solution. In the past we did not offer this due to
> > >>>>>>>> performance
> > >>>>>>>>>>>> concerns, but we could allow an "opt-in" mechanism. If you
> > >>>>>>>> disagree,
> > >>>>>>>>>> can
> > >>>>>>>>>>>> you provide some reasoning and add them to the "Rejected
> > >>>>>>>>> alternatives"
> > >>>>>>>>>>>> section.
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> To rephrase: I understand the issue about missing top-level
> > >>>>>> view,
> > >>>>>>>> but
> > >>>>>>>>>>>> instead of going more fine grained, we should consider to
> add
> > >>>>>>> this
> > >>>>>>>>>>>> top-level view and add/keep the fine grained metrics at
> DEBUG
> > >>>>>>> level
> > >>>>>>>>>> only
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> I am +1 to add TopologyTestDriver#metrics() and to remove
> old
> > >>>>>>>> metrics
> > >>>>>>>>>>>> directly as you suggested.
> > >>>>>>>>>>>>
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> -Matthias
> > >>>>>>>>>>>>
> > >>>>>>>>>>>>
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> On 3/28/18 6:42 PM, Ted Yu wrote:
> > >>>>>>>>>>>>> Looks good to me.
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> On Wed, Mar 28, 2018 at 3:11 PM, John Roesler <
> > >>>>>>> john@confluent.io
> > >>>>>>>>>
> > >>>>>>>>>>> wrote:
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>>> Hello all,
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> I am proposing KIP-274 to improve the metrics around
> > >>>>>> skipped
> > >>>>>>>>> records
> > >>>>>>>>>>> in
> > >>>>>>>>>>>>>> Streams.
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> Please find the details here:
> > >>>>>>>>>>>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-
> > >>>>>>>>>>>>>> 274%3A+Kafka+Streams+Skipped+Records+Metrics
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> Please let me know what you think!
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> Thanks,
> > >>>>>>>>>>>>>> -John
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>
> > >>>>>>>>>>>>
> > >>>>>>>>>>>
> > >>>>>>>>>>>
> > >>>>>>>>>>> --
> > >>>>>>>>>>> -- Guozhang
> > >>>>>>>>>>>
> > >>>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>> --
> > >>>>>>>>> -- Guozhang
> > >>>>>>>>>
> > >>>>>>>>
> > >>>>>>>
> > >>>>>>>
> > >>>>>>>
> > >>>>>>> --
> > >>>>>>> -- Guozhang
> > >>>>>>>
> > >>>>>>
> > >>>>>
> > >>>>>
> > >>>>
> > >>>
> > >>>
> > >>
> > >
> > >
> > >
> >
> >
>



-- 
-- Guozhang

Re: [DISCUSS] KIP-274: Kafka Streams Skipped Records Metrics

Posted by John Roesler <jo...@confluent.io>.

I agree we should add as much information as is reasonable to the log. For
example, see this WIP PR I started for this KIP:

https://github.com/apache/kafka/pull/4812/files#diff-88d129f048bc842c7db5b2566a45fce8R80

and

https://github.com/apache/kafka/pull/4812/files#diff-69e6789eb675ec978a1abd24fed96eb1R111

I'm not sure if we should nail down the log messages in the KIP or in the
PR discussion. What say you?

Thanks,
-John

On Tue, Apr 3, 2018 at 12:20 AM, Matthias J. Sax <ma...@confluent.io>
wrote:

> Thanks for sharing your thoughts. As I mentioned originally, I am not
> sure about the right log level either. Your arguments are convincing --
> thus, I am fine with keeping WARN level.
>
> The task vs thread level argument is an interesting one. However, I am
> wondering if we should add this information into the corresponding WARN
> logs that we write anyway? For this case, we can also log the
> corresponding operator (and other information like topic name etc if
> needed). WDYT about this?
>
>
> -Matthias
>
> On 4/2/18 8:31 PM, Guozhang Wang wrote:
> > Regarding logging: I'm inclined to keep logging at WARN level since
> skipped
> > records are not expected in normal execution (for all reasons that we are
> > aware of), and hence when error happens users should be alerted from
> > metrics and looked into the log files, so to me if it is really spamming
> > the log files it is also a good alert for users. Besides for deserialize
> > errors we already log at WARN level for this reason.
> >
> > Regarding the metrics-levels: I was pondering on that as well. What made
> me
> > to think and agree on task-level than thread-level is that for some
> reasons
> > like window retention, they may possibly be happening on a subset of
> input
> > partitions, and tasks are correlated with partitions the task-level
> metrics
> > can help users to narrow down on the specific input data partitions.
> >
> >
> > Guozhang
> >
> >
> > On Mon, Apr 2, 2018 at 6:43 PM, John Roesler <jo...@confluent.io> wrote:
> >
> >> Hi Matthias,
> >>
> >> No worries! Thanks for the reply.
> >>
> >> 1) There isn't a connection. I tried using the TopologyTestDriver to
> write
> >> a quick test exercising the current behavior and discovered that the
> >> metrics weren't available. It seemed like they should be, so I tacked
> it on
> >> to this KIP. If you feel it's inappropriate, I can pull it back out.
> >>
> >> 2) I was also concerned about that, but I figured it would come up in
> >> discussion if I just went ahead and proposed it. And here we are!
> >>
> >> Here's my thought: maybe there are two classes of skips: "controlled"
> and
> >> "uncontrolled", where "controlled" means, as an app author, I
> deliberately
> >> filter out some events, and "uncontrolled" means that I simply don't
> >> account for some feature of the data, and the framework skips them (as
> >> opposed to crashing).
> >>
> >> In this breakdowns, the skips I'm adding metrics for are all
> uncontrolled
> >> skips (and we hope to measure all the uncontrolled skips). Our skips are
> >> well documented, so it wouldn't be terrible to have an application in
> which
> >> you know you expect to have tons of uncontrolled skips, but it's not
> great
> >> either, since you may also have some *unexpected* uncontrolled skips.
> It'll
> >> be difficult to notice, since you're probably not alerting on the metric
> >> and filtering out the logs (whatever their level).
> >>
> >> I'd recommend any app author, as an alternative, to convert all expected
> >> skips to controlled ones, by updating the topology to filter those
> records
> >> out.
> >>
> >> Following from my recommendation, as a library author, I'm inclined to
> mark
> >> those logs WARN, since in my opinion, they should be concerning to the
> app
> >> authors. I'd definitely want to show, rather than hide, them by
> default, so
> >> I would pick INFO at least.
> >>
> >> That said, logging is always a tricky issue for lower-level libraries
> that
> >> run inside user code, since we don't have all the information we need to
> >> make the right call.
> >>
> >>
> >>
> >> On your last note, yeah, I got that impression from Guozhang as well.
> >> Thanks for the clarification.
> >>
> >> -John
> >>
> >>
> >>
> >> On Mon, Apr 2, 2018 at 4:03 PM, Matthias J. Sax <ma...@confluent.io>
> >> wrote:
> >>
> >>> John,
> >>>
> >>> sorry for my late reply and thanks for updating the KIP.
> >>>
> >>> I like your approach about "metrics are for monitoring, logs are for
> >>> debugging" -- however:
> >>>
> >>> 1) I don't see a connection between this and the task-level metrics
> that
> >>> you propose to get the metrics in `TopologyTestDriver`. I don't think
> >>> people would monitor the `TopologyTestDriver` an thus wondering why it
> >>> is important to include the metrics there? Thread-level metric might be
> >>> easier to monitor though (ie, less different metric to monitor).
> >>>
> >>> 2) I am a little worried about WARN level logging and that it might be
> >>> too chatty -- as you pointed out, it's about debugging, thus DEBUG
> level
> >>> might be better. Not 100% sure about this to be honest. What is the
> >>> general assumption about the frequency for skipped records? I could
> >>> imagine cases for which skipped records are quite frequent and thus,
> >>> WARN level logs might "flood" the logs
> >>>
> >>> One final remark:
> >>>
> >>>> More
> >>>> generally, I would like to establish a pattern in which we could add
> >> new
> >>>> values for the "reason" tags without needing a KIP to do so.
> >>>
> >>> From my understanding, this is not feasible. Changing metrics is always
> >>> considered a public API change, and we need a KIP for any change. As we
> >>> moved away from tagging, it doesn't matter for the KIP anymore -- just
> >>> wanted to point it out.
> >>>
> >>>
> >>> -Matthias
> >>>
> >>>
> >>> On 3/30/18 2:47 PM, John Roesler wrote:
> >>>> Allrighty! The KIP is updated.
> >>>>
> >>>> Thanks again, all, for the feedback.
> >>>> -John
> >>>>
> >>>> On Fri, Mar 30, 2018 at 3:35 PM, John Roesler <jo...@confluent.io>
> >> wrote:
> >>>>
> >>>>> Hey Guozhang and Bill,
> >>>>>
> >>>>> Ok, I'll update the KIP. At the risk of disturbing consensus, I'd
> like
> >>> to
> >>>>> put it in the task instead of the thread so that it'll show up in the
> >>>>> TopologyTestDriver metrics as well.
> >>>>>
> >>>>> I'm leaning toward keeping the scope where it is right now, but if
> >>> others
> >>>>> want to advocate for tossing in some more metrics, we can go that
> >> route.
> >>>>>
> >>>>> Thanks all,
> >>>>> -John
> >>>>>
> >>>>> On Fri, Mar 30, 2018 at 2:37 PM, Bill Bejeck <bb...@gmail.com>
> >> wrote:
> >>>>>
> >>>>>> Thanks for the KIP John, and sorry for the late comments.
> >>>>>>
> >>>>>> I'm on the fence with providing a single level metrics, but I think
> >>> we'll
> >>>>>> have that discussion outside of this KIP.
> >>>>>>
> >>>>>>> * maintain one skipped-record metric (could be per-thread,
> per-task,
> >>> or
> >>>>>>> per-processor-node) with no "reason"
> >>>>>>> * introduce a warn-level log detailing the topic/partition/offset
> >> and
> >>>>>>> reason of the skipped record
> >>>>>>
> >>>>>> I'm +1 on both of these suggestions.
> >>>>>>
> >>>>>> Finally, we have had requests in the past for some metrics around
> >> when
> >>>>>> persistent store removes an expired window.  Would adding that to
> our
> >>>>>> metrics stretch the scope of this KIP too much?
> >>>>>>
> >>>>>> Thanks again and overall I'm +1 on this KIP
> >>>>>>
> >>>>>> Bill
> >>>>>>
> >>>>>> On Fri, Mar 30, 2018 at 2:00 PM, Guozhang Wang <wa...@gmail.com>
> >>>>>> wrote:
> >>>>>>
> >>>>>>> The proposal sounds good to me. About "maintain only one level of
> >>>>>> metrics"
> >>>>>>> maybe we can discuss about that separately from this KIP since that
> >>>>>> would
> >>>>>>> be a larger scope of discussion. I agree that if we are going to
> >>>>>> maintain
> >>>>>>> only one-level metrics it should be lowest level and we would let
> >>> users
> >>>>>> to
> >>>>>>> do the roll-ups themselves, but I'm still not fully convinced that
> >> we
> >>>>>>> should just provide single-level metrics, because 1) I think for
> >>>>>> different
> >>>>>>> metrics people may be interested to investigate into different
> >>>>>>> granularities, e.g. for poll / commit rate these are at the lowest
> >>>>>>> task-level metrics, while for process-rate / skip-rate they can be
> >> as
> >>>>>> low
> >>>>>>> as processor-node metrics, and 2) user-side rolling ups may not be
> >>> very
> >>>>>>> straight-forward. But for 2) if someone can provide an efficient
> and
> >>>>>> easy
> >>>>>>> implementation of that I can be persuaded :)
> >>>>>>>
> >>>>>>> For now I'm thinking we can add the metric on thread-level, either
> >>> with
> >>>>>>> finer grained ones with "reason" tag plus an aggregated one without
> >>> the
> >>>>>>> tag, or just having a single aggregated metric without the tag
> looks
> >>>>>> good
> >>>>>>> to me.
> >>>>>>>
> >>>>>>>
> >>>>>>> Guozhang
> >>>>>>>
> >>>>>>> On Fri, Mar 30, 2018 at 8:05 AM, John Roesler <jo...@confluent.io>
> >>>>>> wrote:
> >>>>>>>
> >>>>>>>> Hey Guozhang,
> >>>>>>>>
> >>>>>>>> Thanks for the reply. Regarding JMX, I can dig it. I'll provide a
> >>>>>> list in
> >>>>>>>> the KIP. I was also thinking we'd better start a documentation
> page
> >>>>>> with
> >>>>>>>> the metrics listed.
> >>>>>>>>
> >>>>>>>> I'd have no problem logging a warning when we skip records. On the
> >>>>>> metric
> >>>>>>>> front, really I'm just pushing for us to maintain only one level
> of
> >>>>>>>> metrics. If that's more or less granular (i.e., maybe we don't
> >> have a
> >>>>>>>> metric per reason and log the reason instead), that's fine by me.
> I
> >>>>>> just
> >>>>>>>> don't think it provides a lot of extra value per complexity
> >>> (interface
> >>>>>>> and
> >>>>>>>> implementation) to maintain roll-ups at the thread level in
> >> addition
> >>>>>> to
> >>>>>>>> lower-level metrics.
> >>>>>>>>
> >>>>>>>> How about this instead:
> >>>>>>>> * maintain one skipped-record metric (could be per-thread,
> >> per-task,
> >>>>>> or
> >>>>>>>> per-processor-node) with no "reason"
> >>>>>>>> * introduce a warn-level log detailing the topic/partition/offset
> >> and
> >>>>>>>> reason of the skipped record
> >>>>>>>>
> >>>>>>>> If you like that, I can update the KIP.
> >>>>>>>>
> >>>>>>>> Thanks,
> >>>>>>>> -John
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> On Thu, Mar 29, 2018 at 6:22 PM, Guozhang Wang <
> wangguoz@gmail.com
> >>>
> >>>>>>> wrote:
> >>>>>>>>
> >>>>>>>>>> One thing you mention is the notion of setting alerts on coarser
> >>>>>>>> metrics
> >>>>>>>>> being easier than finer ones. All the metric alerting systems I
> >> have
> >>>>>>> used
> >>>>>>>>> make it equally easy to alert on metrics by-tag or over tags. So
> >> my
> >>>>>>>>> experience doesn't say that this is a use case. Were you thinking
> >>>>>> of an
> >>>>>>>>> alerting system that makes such a pre-aggregation valuable?
> >>>>>>>>>
> >>>>>>>>> For the commonly used JMX reporter tags will be encoded directly
> >> as
> >>>>>>> part
> >>>>>>>> of
> >>>>>>>>> the object name, and if users wants to monitor them they need to
> >>>>>> know
> >>>>>>>> these
> >>>>>>>>> values before hand. That is also why I think we do want to list
> >> all
> >>>>>> the
> >>>>>>>>> possible values of the reason tags in the KIP, since
> >>>>>>>>>
> >>>>>>>>>> In my email in response to Matthias, I gave an example of the
> >>>>>> kind of
> >>>>>>>>> scenario that would lead me as an operator to run with DEBUG on
> >> all
> >>>>>> the
> >>>>>>>>> time, since I wouldn't be sure, having seen a skipped record
> once,
> >>>>>> that
> >>>>>>>> it
> >>>>>>>>> would ever happen again. The solution is to capture all the
> >>>>>> available
> >>>>>>>>> information about the reason and location of skips all the time.
> >>>>>>>>>
> >>>>>>>>> That is a good point. I think we can either expose all levels
> >>>>>> metrics
> >>>>>>> as
> >>>>>>>> by
> >>>>>>>>> default, or only expose the most lower-level metrics and get rid
> >> of
> >>>>>>> other
> >>>>>>>>> levels to let users do roll-ups themselves (which will be a much
> >>>>>> larger
> >>>>>>>>> scope for discussion), or we can encourage users to not purely
> >>>>>> depend
> >>>>>>> on
> >>>>>>>>> metrics for such trouble shooting: that is to say, users only be
> >>>>>>> alerted
> >>>>>>>>> based on metrics, and we can log a info / warn log4j entry each
> >>>>>> time we
> >>>>>>>> are
> >>>>>>>>> about to skip a record all over the places, so that upon being
> >>>>>> notified
> >>>>>>>>> users can look into the logs to find the details on where / when
> >> it
> >>>>>>>>> happens. WDYT?
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> Guozhang
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> On Thu, Mar 29, 2018 at 3:57 PM, John Roesler <john@confluent.io
> >
> >>>>>>> wrote:
> >>>>>>>>>
> >>>>>>>>>> Hey Guozhang,
> >>>>>>>>>>
> >>>>>>>>>> Thanks for the review.
> >>>>>>>>>>
> >>>>>>>>>> 1.
> >>>>>>>>>> Matthias raised the same question about the "reason" tag values.
> >> I
> >>>>>>> can
> >>>>>>>>> list
> >>>>>>>>>> all possible values of the "reason" tag, but I'm thinking this
> >>>>>> level
> >>>>>>> of
> >>>>>>>>>> detail may not be KIP-worthy, maybe the code and documentation
> >>>>>> review
> >>>>>>>>> would
> >>>>>>>>>> be sufficient. If you all disagree and would like it included in
> >>>>>> the
> >>>>>>>>> KIP, I
> >>>>>>>>>> can certainly do that.
> >>>>>>>>>>
> >>>>>>>>>> If we do provide roll-up metrics, I agree with the pattern of
> >>>>>> keeping
> >>>>>>>> the
> >>>>>>>>>> same name but eliminating the tags for the dimensions that were
> >>>>>>>>> rolled-up.
> >>>>>>>>>>
> >>>>>>>>>> 2.
> >>>>>>>>>> I'm not too sure that implementation efficiency really becomes a
> >>>>>>> factor
> >>>>>>>>> in
> >>>>>>>>>> choosing whether to (by default) update one coarse metric at the
> >>>>>>> thread
> >>>>>>>>>> level or one granular metric at the processor-node level, since
> >>>>>> it's
> >>>>>>>> just
> >>>>>>>>>> one metric being updated either way. I do agree that if we were
> >> to
> >>>>>>>> update
> >>>>>>>>>> the granular metrics and multiple roll-ups, then we should
> >>>>>> consider
> >>>>>>> the
> >>>>>>>>>> efficiency.
> >>>>>>>>>>
> >>>>>>>>>> I agree it's probably not necessary to surface the metrics for
> >> all
> >>>>>>>> nodes
> >>>>>>>>>> regardless of whether they can or do skip records. Perhaps we
> can
> >>>>>>>> lazily
> >>>>>>>>>> register the metrics.
> >>>>>>>>>>
> >>>>>>>>>> In my email in response to Matthias, I gave an example of the
> >>>>>> kind of
> >>>>>>>>>> scenario that would lead me as an operator to run with DEBUG on
> >>>>>> all
> >>>>>>> the
> >>>>>>>>>> time, since I wouldn't be sure, having seen a skipped record
> >> once,
> >>>>>>> that
> >>>>>>>>> it
> >>>>>>>>>> would ever happen again. The solution is to capture all the
> >>>>>> available
> >>>>>>>>>> information about the reason and location of skips all the time.
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> One thing you mention is the notion of setting alerts on coarser
> >>>>>>>> metrics
> >>>>>>>>>> being easier than finer ones. All the metric alerting systems I
> >>>>>> have
> >>>>>>>> used
> >>>>>>>>>> make it equally easy to alert on metrics by-tag or over tags. So
> >>>>>> my
> >>>>>>>>>> experience doesn't say that this is a use case. Were you
> thinking
> >>>>>> of
> >>>>>>> an
> >>>>>>>>>> alerting system that makes such a pre-aggregation valuable?
> >>>>>>>>>>
> >>>>>>>>>> Thanks again,
> >>>>>>>>>> -John
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> On Thu, Mar 29, 2018 at 5:24 PM, Guozhang Wang <
> >>>>>> wangguoz@gmail.com>
> >>>>>>>>> wrote:
> >>>>>>>>>>
> >>>>>>>>>>> Hello John,
> >>>>>>>>>>>
> >>>>>>>>>>> Thanks for the KIP. Some comments:
> >>>>>>>>>>>
> >>>>>>>>>>> 1. Could you list all the possible values of the "reason" tag?
> >>>>>> In
> >>>>>>> the
> >>>>>>>>>> JIRA
> >>>>>>>>>>> ticket I left some potential reasons but I'm not clear if
> you're
> >>>>>>>> going
> >>>>>>>>> to
> >>>>>>>>>>> categorize each of them as a separate reason, or is there any
> >>>>>>>>> additional
> >>>>>>>>>>> ones you have in mind.
> >>>>>>>>>>>
> >>>>>>>>>>> Also I'm wondering if we should add another metric that do not
> >>>>>> have
> >>>>>>>> the
> >>>>>>>>>>> reason tag but aggregates among all possible reasons? This is
> >>>>>> for
> >>>>>>>> users
> >>>>>>>>>> to
> >>>>>>>>>>> easily set their alerting notifications (otherwise they have to
> >>>>>>> write
> >>>>>>>>> on
> >>>>>>>>>>> notification rule per reason) in their monitoring systems.
> >>>>>>>>>>>
> >>>>>>>>>>> 2. Note that the processor-node metrics is actually
> "per-thread,
> >>>>>>>>>> per-task,
> >>>>>>>>>>> per-processor-node", and today we only set the per-thread
> >>>>>> metrics
> >>>>>>> as
> >>>>>>>>> INFO
> >>>>>>>>>>> while leaving the lower two layers as DEBUG. I agree with your
> >>>>>>>> argument
> >>>>>>>>>>> that we are missing the per-client roll-up metrics today, but
> >>>>>> I'm
> >>>>>>>>>> convinced
> >>>>>>>>>>> that the right way to approach it would be
> >>>>>>>> "just-providing-the-lowest-
> >>>>>>>>>>> level
> >>>>>>>>>>> metrics only".
> >>>>>>>>>>>
> >>>>>>>>>>> Note the recoding implementation of these three levels are
> >>>>>>> different
> >>>>>>>>>>> internally today: we did not just do the rolling up to generate
> >>>>>> the
> >>>>>>>>>>> higher-level metrics from the lower level ones, but we just
> >>>>>> record
> >>>>>>>> them
> >>>>>>>>>>> separately, which means that, if we turn on multiple levels of
> >>>>>>>> metrics,
> >>>>>>>>>> we
> >>>>>>>>>>> maybe duplicate collecting some metrics. One can argue that is
> >>>>>> not
> >>>>>>>> the
> >>>>>>>>>> best
> >>>>>>>>>>> way to represent multi-level metrics collecting and reporting,
> >>>>>> but
> >>>>>>> by
> >>>>>>>>>> only
> >>>>>>>>>>> enabling thread-level metrics as INFO today, that
> implementation
> >>>>>>>> could
> >>>>>>>>> be
> >>>>>>>>>>> more efficient than only collecting the metrics at the lowest
> >>>>>>> level,
> >>>>>>>>> and
> >>>>>>>>>>> then do the roll-up calculations outside of the metrics
> classes.
> >>>>>>>>>>>
> >>>>>>>>>>> Plus, today not all processor-nodes may possibly skip records,
> >>>>>>> AFAIK
> >>>>>>>> we
> >>>>>>>>>>> will only skip records at the source, sink, window and
> >>>>>> aggregation
> >>>>>>>>>>> processor nodes, so adding a metric per processor looks like an
> >>>>>>>>> overkill
> >>>>>>>>>> to
> >>>>>>>>>>> me as well. On the other hand, from user's perspective the
> >>>>>> "reason"
> >>>>>>>> tag
> >>>>>>>>>> may
> >>>>>>>>>>> be sufficient for them to narrow down where inside the topology
> >>>>>> is
> >>>>>>>>>> causing
> >>>>>>>>>>> records to be dropped on the floor. So I think the "per-thread,
> >>>>>>>>> per-task"
> >>>>>>>>>>> level metrics should be sufficient for them in trouble shoot in
> >>>>>>> DEBUG
> >>>>>>>>>> mode,
> >>>>>>>>>>> and we can add another "per-thread" level metrics as INFO which
> >>>>>> is
> >>>>>>>>> turned
> >>>>>>>>>>> on by default. So under normal execution users still only need
> >>>>>> INFO
> >>>>>>>>> level
> >>>>>>>>>>> metrics for alerting (e.g. set alerts on all skipped-records
> >>>>>>> metrics
> >>>>>>>> as
> >>>>>>>>>>> non-zero), and then upon trouble shooting they can turn on
> DEBUG
> >>>>>>>>> metrics
> >>>>>>>>>> to
> >>>>>>>>>>> look into which task is actually causing the skipped records.
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> Guozhang
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> On Thu, Mar 29, 2018 at 2:03 PM, Matthias J. Sax <
> >>>>>>>>> matthias@confluent.io>
> >>>>>>>>>>> wrote:
> >>>>>>>>>>>
> >>>>>>>>>>>> Thanks for the KIP John.
> >>>>>>>>>>>>
> >>>>>>>>>>>> Reading the material on the related Jiras, I am wondering what
> >>>>>>>>> `reason`
> >>>>>>>>>>>> tags you want to introduce? Can you elaborate? The KIP should
> >>>>>>> list
> >>>>>>>>>> those
> >>>>>>>>>>>> IMHO.
> >>>>>>>>>>>>
> >>>>>>>>>>>> About the fine grained metrics vs the roll-up: you say that
> >>>>>>>>>>>>
> >>>>>>>>>>>>> the coarse metric aggregates across two dimensions
> >>>>>>> simultaneously
> >>>>>>>>>>>>
> >>>>>>>>>>>> Can you elaborate why this is an issue? I am not convinced atm
> >>>>>>> that
> >>>>>>>>> we
> >>>>>>>>>>>> should put the fine grained metrics into INFO level and remove
> >>>>>>> the
> >>>>>>>>>>>> roll-up at thread level.
> >>>>>>>>>>>>
> >>>>>>>>>>>>> Given that they have to do this sum to get a usable
> >>>>>> top-level
> >>>>>>>> view
> >>>>>>>>>>>>
> >>>>>>>>>>>> This is a fair concern, but I don't share the conclusion.
> >>>>>>> Offering
> >>>>>>>> a
> >>>>>>>>>>>> built-in `KafkaStreams` "client" roll-up out of the box might
> >>>>>> be
> >>>>>>> a
> >>>>>>>>>>>> better solution. In the past we did not offer this due to
> >>>>>>>> performance
> >>>>>>>>>>>> concerns, but we could allow an "opt-in" mechanism. If you
> >>>>>>>> disagree,
> >>>>>>>>>> can
> >>>>>>>>>>>> you provide some reasoning and add them to the "Rejected
> >>>>>>>>> alternatives"
> >>>>>>>>>>>> section.
> >>>>>>>>>>>>
> >>>>>>>>>>>> To rephrase: I understand the issue about missing top-level
> >>>>>> view,
> >>>>>>>> but
> >>>>>>>>>>>> instead of going more fine grained, we should consider to add
> >>>>>>> this
> >>>>>>>>>>>> top-level view and add/keep the fine grained metrics at DEBUG
> >>>>>>> level
> >>>>>>>>>> only
> >>>>>>>>>>>>
> >>>>>>>>>>>> I am +1 to add TopologyTestDriver#metrics() and to remove old
> >>>>>>>> metrics
> >>>>>>>>>>>> directly as you suggested.
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>> -Matthias
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>> On 3/28/18 6:42 PM, Ted Yu wrote:
> >>>>>>>>>>>>> Looks good to me.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> On Wed, Mar 28, 2018 at 3:11 PM, John Roesler <
> >>>>>>> john@confluent.io
> >>>>>>>>>
> >>>>>>>>>>> wrote:
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>> Hello all,
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> I am proposing KIP-274 to improve the metrics around
> >>>>>> skipped
> >>>>>>>>> records
> >>>>>>>>>>> in
> >>>>>>>>>>>>>> Streams.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Please find the details here:
> >>>>>>>>>>>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-
> >>>>>>>>>>>>>> 274%3A+Kafka+Streams+Skipped+Records+Metrics
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Please let me know what you think!
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Thanks,
> >>>>>>>>>>>>>> -John
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> --
> >>>>>>>>>>> -- Guozhang
> >>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> --
> >>>>>>>>> -- Guozhang
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>> --
> >>>>>>> -- Guozhang
> >>>>>>>
> >>>>>>
> >>>>>
> >>>>>
> >>>>
> >>>
> >>>
> >>
> >
> >
> >
>
>

Re: [DISCUSS] KIP-274: Kafka Streams Skipped Records Metrics

Posted by "Matthias J. Sax" <ma...@confluent.io>.

Thanks for sharing your thoughts. As I mentioned originally, I am not
sure about the right log level either. Your arguments are convincing --
thus, I am fine with keeping WARN level.

The task vs thread level argument is an interesting one. However, I am
wondering if we should add this information into the corresponding WARN
logs that we write anyway? For this case, we can also log the
corresponding operator (and other information like topic name etc if
needed). WDYT about this?


-Matthias

On 4/2/18 8:31 PM, Guozhang Wang wrote:
> Regarding logging: I'm inclined to keep logging at WARN level since skipped
> records are not expected in normal execution (for all reasons that we are
> aware of), and hence when error happens users should be alerted from
> metrics and looked into the log files, so to me if it is really spamming
> the log files it is also a good alert for users. Besides for deserialize
> errors we already log at WARN level for this reason.
> 
> Regarding the metrics-levels: I was pondering on that as well. What made me
> to think and agree on task-level than thread-level is that for some reasons
> like window retention, they may possibly be happening on a subset of input
> partitions, and tasks are correlated with partitions the task-level metrics
> can help users to narrow down on the specific input data partitions.
> 
> 
> Guozhang
> 
> 
> On Mon, Apr 2, 2018 at 6:43 PM, John Roesler <jo...@confluent.io> wrote:
> 
>> Hi Matthias,
>>
>> No worries! Thanks for the reply.
>>
>> 1) There isn't a connection. I tried using the TopologyTestDriver to write
>> a quick test exercising the current behavior and discovered that the
>> metrics weren't available. It seemed like they should be, so I tacked it on
>> to this KIP. If you feel it's inappropriate, I can pull it back out.
>>
>> 2) I was also concerned about that, but I figured it would come up in
>> discussion if I just went ahead and proposed it. And here we are!
>>
>> Here's my thought: maybe there are two classes of skips: "controlled" and
>> "uncontrolled", where "controlled" means, as an app author, I deliberately
>> filter out some events, and "uncontrolled" means that I simply don't
>> account for some feature of the data, and the framework skips them (as
>> opposed to crashing).
>>
>> In this breakdowns, the skips I'm adding metrics for are all uncontrolled
>> skips (and we hope to measure all the uncontrolled skips). Our skips are
>> well documented, so it wouldn't be terrible to have an application in which
>> you know you expect to have tons of uncontrolled skips, but it's not great
>> either, since you may also have some *unexpected* uncontrolled skips. It'll
>> be difficult to notice, since you're probably not alerting on the metric
>> and filtering out the logs (whatever their level).
>>
>> I'd recommend any app author, as an alternative, to convert all expected
>> skips to controlled ones, by updating the topology to filter those records
>> out.
>>
>> Following from my recommendation, as a library author, I'm inclined to mark
>> those logs WARN, since in my opinion, they should be concerning to the app
>> authors. I'd definitely want to show, rather than hide, them by default, so
>> I would pick INFO at least.
>>
>> That said, logging is always a tricky issue for lower-level libraries that
>> run inside user code, since we don't have all the information we need to
>> make the right call.
>>
>>
>>
>> On your last note, yeah, I got that impression from Guozhang as well.
>> Thanks for the clarification.
>>
>> -John
>>
>>
>>
>> On Mon, Apr 2, 2018 at 4:03 PM, Matthias J. Sax <ma...@confluent.io>
>> wrote:
>>
>>> John,
>>>
>>> sorry for my late reply and thanks for updating the KIP.
>>>
>>> I like your approach about "metrics are for monitoring, logs are for
>>> debugging" -- however:
>>>
>>> 1) I don't see a connection between this and the task-level metrics that
>>> you propose to get the metrics in `TopologyTestDriver`. I don't think
>>> people would monitor the `TopologyTestDriver` an thus wondering why it
>>> is important to include the metrics there? Thread-level metric might be
>>> easier to monitor though (ie, less different metric to monitor).
>>>
>>> 2) I am a little worried about WARN level logging and that it might be
>>> too chatty -- as you pointed out, it's about debugging, thus DEBUG level
>>> might be better. Not 100% sure about this to be honest. What is the
>>> general assumption about the frequency for skipped records? I could
>>> imagine cases for which skipped records are quite frequent and thus,
>>> WARN level logs might "flood" the logs
>>>
>>> One final remark:
>>>
>>>> More
>>>> generally, I would like to establish a pattern in which we could add
>> new
>>>> values for the "reason" tags without needing a KIP to do so.
>>>
>>> From my understanding, this is not feasible. Changing metrics is always
>>> considered a public API change, and we need a KIP for any change. As we
>>> moved away from tagging, it doesn't matter for the KIP anymore -- just
>>> wanted to point it out.
>>>
>>>
>>> -Matthias
>>>
>>>
>>> On 3/30/18 2:47 PM, John Roesler wrote:
>>>> Allrighty! The KIP is updated.
>>>>
>>>> Thanks again, all, for the feedback.
>>>> -John
>>>>
>>>> On Fri, Mar 30, 2018 at 3:35 PM, John Roesler <jo...@confluent.io>
>> wrote:
>>>>
>>>>> Hey Guozhang and Bill,
>>>>>
>>>>> Ok, I'll update the KIP. At the risk of disturbing consensus, I'd like
>>> to
>>>>> put it in the task instead of the thread so that it'll show up in the
>>>>> TopologyTestDriver metrics as well.
>>>>>
>>>>> I'm leaning toward keeping the scope where it is right now, but if
>>> others
>>>>> want to advocate for tossing in some more metrics, we can go that
>> route.
>>>>>
>>>>> Thanks all,
>>>>> -John
>>>>>
>>>>> On Fri, Mar 30, 2018 at 2:37 PM, Bill Bejeck <bb...@gmail.com>
>> wrote:
>>>>>
>>>>>> Thanks for the KIP John, and sorry for the late comments.
>>>>>>
>>>>>> I'm on the fence with providing a single level metrics, but I think
>>> we'll
>>>>>> have that discussion outside of this KIP.
>>>>>>
>>>>>>> * maintain one skipped-record metric (could be per-thread, per-task,
>>> or
>>>>>>> per-processor-node) with no "reason"
>>>>>>> * introduce a warn-level log detailing the topic/partition/offset
>> and
>>>>>>> reason of the skipped record
>>>>>>
>>>>>> I'm +1 on both of these suggestions.
>>>>>>
>>>>>> Finally, we have had requests in the past for some metrics around
>> when
>>>>>> persistent store removes an expired window.  Would adding that to our
>>>>>> metrics stretch the scope of this KIP too much?
>>>>>>
>>>>>> Thanks again and overall I'm +1 on this KIP
>>>>>>
>>>>>> Bill
>>>>>>
>>>>>> On Fri, Mar 30, 2018 at 2:00 PM, Guozhang Wang <wa...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> The proposal sounds good to me. About "maintain only one level of
>>>>>> metrics"
>>>>>>> maybe we can discuss about that separately from this KIP since that
>>>>>> would
>>>>>>> be a larger scope of discussion. I agree that if we are going to
>>>>>> maintain
>>>>>>> only one-level metrics it should be lowest level and we would let
>>> users
>>>>>> to
>>>>>>> do the roll-ups themselves, but I'm still not fully convinced that
>> we
>>>>>>> should just provide single-level metrics, because 1) I think for
>>>>>> different
>>>>>>> metrics people may be interested to investigate into different
>>>>>>> granularities, e.g. for poll / commit rate these are at the lowest
>>>>>>> task-level metrics, while for process-rate / skip-rate they can be
>> as
>>>>>> low
>>>>>>> as processor-node metrics, and 2) user-side rolling ups may not be
>>> very
>>>>>>> straight-forward. But for 2) if someone can provide an efficient and
>>>>>> easy
>>>>>>> implementation of that I can be persuaded :)
>>>>>>>
>>>>>>> For now I'm thinking we can add the metric on thread-level, either
>>> with
>>>>>>> finer grained ones with "reason" tag plus an aggregated one without
>>> the
>>>>>>> tag, or just having a single aggregated metric without the tag looks
>>>>>> good
>>>>>>> to me.
>>>>>>>
>>>>>>>
>>>>>>> Guozhang
>>>>>>>
>>>>>>> On Fri, Mar 30, 2018 at 8:05 AM, John Roesler <jo...@confluent.io>
>>>>>> wrote:
>>>>>>>
>>>>>>>> Hey Guozhang,
>>>>>>>>
>>>>>>>> Thanks for the reply. Regarding JMX, I can dig it. I'll provide a
>>>>>> list in
>>>>>>>> the KIP. I was also thinking we'd better start a documentation page
>>>>>> with
>>>>>>>> the metrics listed.
>>>>>>>>
>>>>>>>> I'd have no problem logging a warning when we skip records. On the
>>>>>> metric
>>>>>>>> front, really I'm just pushing for us to maintain only one level of
>>>>>>>> metrics. If that's more or less granular (i.e., maybe we don't
>> have a
>>>>>>>> metric per reason and log the reason instead), that's fine by me. I
>>>>>> just
>>>>>>>> don't think it provides a lot of extra value per complexity
>>> (interface
>>>>>>> and
>>>>>>>> implementation) to maintain roll-ups at the thread level in
>> addition
>>>>>> to
>>>>>>>> lower-level metrics.
>>>>>>>>
>>>>>>>> How about this instead:
>>>>>>>> * maintain one skipped-record metric (could be per-thread,
>> per-task,
>>>>>> or
>>>>>>>> per-processor-node) with no "reason"
>>>>>>>> * introduce a warn-level log detailing the topic/partition/offset
>> and
>>>>>>>> reason of the skipped record
>>>>>>>>
>>>>>>>> If you like that, I can update the KIP.
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> -John
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Thu, Mar 29, 2018 at 6:22 PM, Guozhang Wang <wangguoz@gmail.com
>>>
>>>>>>> wrote:
>>>>>>>>
>>>>>>>>>> One thing you mention is the notion of setting alerts on coarser
>>>>>>>> metrics
>>>>>>>>> being easier than finer ones. All the metric alerting systems I
>> have
>>>>>>> used
>>>>>>>>> make it equally easy to alert on metrics by-tag or over tags. So
>> my
>>>>>>>>> experience doesn't say that this is a use case. Were you thinking
>>>>>> of an
>>>>>>>>> alerting system that makes such a pre-aggregation valuable?
>>>>>>>>>
>>>>>>>>> For the commonly used JMX reporter tags will be encoded directly
>> as
>>>>>>> part
>>>>>>>> of
>>>>>>>>> the object name, and if users wants to monitor them they need to
>>>>>> know
>>>>>>>> these
>>>>>>>>> values before hand. That is also why I think we do want to list
>> all
>>>>>> the
>>>>>>>>> possible values of the reason tags in the KIP, since
>>>>>>>>>
>>>>>>>>>> In my email in response to Matthias, I gave an example of the
>>>>>> kind of
>>>>>>>>> scenario that would lead me as an operator to run with DEBUG on
>> all
>>>>>> the
>>>>>>>>> time, since I wouldn't be sure, having seen a skipped record once,
>>>>>> that
>>>>>>>> it
>>>>>>>>> would ever happen again. The solution is to capture all the
>>>>>> available
>>>>>>>>> information about the reason and location of skips all the time.
>>>>>>>>>
>>>>>>>>> That is a good point. I think we can either expose all levels
>>>>>> metrics
>>>>>>> as
>>>>>>>> by
>>>>>>>>> default, or only expose the most lower-level metrics and get rid
>> of
>>>>>>> other
>>>>>>>>> levels to let users do roll-ups themselves (which will be a much
>>>>>> larger
>>>>>>>>> scope for discussion), or we can encourage users to not purely
>>>>>> depend
>>>>>>> on
>>>>>>>>> metrics for such trouble shooting: that is to say, users only be
>>>>>>> alerted
>>>>>>>>> based on metrics, and we can log a info / warn log4j entry each
>>>>>> time we
>>>>>>>> are
>>>>>>>>> about to skip a record all over the places, so that upon being
>>>>>> notified
>>>>>>>>> users can look into the logs to find the details on where / when
>> it
>>>>>>>>> happens. WDYT?
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Guozhang
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Thu, Mar 29, 2018 at 3:57 PM, John Roesler <jo...@confluent.io>
>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> Hey Guozhang,
>>>>>>>>>>
>>>>>>>>>> Thanks for the review.
>>>>>>>>>>
>>>>>>>>>> 1.
>>>>>>>>>> Matthias raised the same question about the "reason" tag values.
>> I
>>>>>>> can
>>>>>>>>> list
>>>>>>>>>> all possible values of the "reason" tag, but I'm thinking this
>>>>>> level
>>>>>>> of
>>>>>>>>>> detail may not be KIP-worthy, maybe the code and documentation
>>>>>> review
>>>>>>>>> would
>>>>>>>>>> be sufficient. If you all disagree and would like it included in
>>>>>> the
>>>>>>>>> KIP, I
>>>>>>>>>> can certainly do that.
>>>>>>>>>>
>>>>>>>>>> If we do provide roll-up metrics, I agree with the pattern of
>>>>>> keeping
>>>>>>>> the
>>>>>>>>>> same name but eliminating the tags for the dimensions that were
>>>>>>>>> rolled-up.
>>>>>>>>>>
>>>>>>>>>> 2.
>>>>>>>>>> I'm not too sure that implementation efficiency really becomes a
>>>>>>> factor
>>>>>>>>> in
>>>>>>>>>> choosing whether to (by default) update one coarse metric at the
>>>>>>> thread
>>>>>>>>>> level or one granular metric at the processor-node level, since
>>>>>> it's
>>>>>>>> just
>>>>>>>>>> one metric being updated either way. I do agree that if we were
>> to
>>>>>>>> update
>>>>>>>>>> the granular metrics and multiple roll-ups, then we should
>>>>>> consider
>>>>>>> the
>>>>>>>>>> efficiency.
>>>>>>>>>>
>>>>>>>>>> I agree it's probably not necessary to surface the metrics for
>> all
>>>>>>>> nodes
>>>>>>>>>> regardless of whether they can or do skip records. Perhaps we can
>>>>>>>> lazily
>>>>>>>>>> register the metrics.
>>>>>>>>>>
>>>>>>>>>> In my email in response to Matthias, I gave an example of the
>>>>>> kind of
>>>>>>>>>> scenario that would lead me as an operator to run with DEBUG on
>>>>>> all
>>>>>>> the
>>>>>>>>>> time, since I wouldn't be sure, having seen a skipped record
>> once,
>>>>>>> that
>>>>>>>>> it
>>>>>>>>>> would ever happen again. The solution is to capture all the
>>>>>> available
>>>>>>>>>> information about the reason and location of skips all the time.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> One thing you mention is the notion of setting alerts on coarser
>>>>>>>> metrics
>>>>>>>>>> being easier than finer ones. All the metric alerting systems I
>>>>>> have
>>>>>>>> used
>>>>>>>>>> make it equally easy to alert on metrics by-tag or over tags. So
>>>>>> my
>>>>>>>>>> experience doesn't say that this is a use case. Were you thinking
>>>>>> of
>>>>>>> an
>>>>>>>>>> alerting system that makes such a pre-aggregation valuable?
>>>>>>>>>>
>>>>>>>>>> Thanks again,
>>>>>>>>>> -John
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Thu, Mar 29, 2018 at 5:24 PM, Guozhang Wang <
>>>>>> wangguoz@gmail.com>
>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> Hello John,
>>>>>>>>>>>
>>>>>>>>>>> Thanks for the KIP. Some comments:
>>>>>>>>>>>
>>>>>>>>>>> 1. Could you list all the possible values of the "reason" tag?
>>>>>> In
>>>>>>> the
>>>>>>>>>> JIRA
>>>>>>>>>>> ticket I left some potential reasons but I'm not clear if you're
>>>>>>>> going
>>>>>>>>> to
>>>>>>>>>>> categorize each of them as a separate reason, or is there any
>>>>>>>>> additional
>>>>>>>>>>> ones you have in mind.
>>>>>>>>>>>
>>>>>>>>>>> Also I'm wondering if we should add another metric that do not
>>>>>> have
>>>>>>>> the
>>>>>>>>>>> reason tag but aggregates among all possible reasons? This is
>>>>>> for
>>>>>>>> users
>>>>>>>>>> to
>>>>>>>>>>> easily set their alerting notifications (otherwise they have to
>>>>>>> write
>>>>>>>>> on
>>>>>>>>>>> notification rule per reason) in their monitoring systems.
>>>>>>>>>>>
>>>>>>>>>>> 2. Note that the processor-node metrics is actually "per-thread,
>>>>>>>>>> per-task,
>>>>>>>>>>> per-processor-node", and today we only set the per-thread
>>>>>> metrics
>>>>>>> as
>>>>>>>>> INFO
>>>>>>>>>>> while leaving the lower two layers as DEBUG. I agree with your
>>>>>>>> argument
>>>>>>>>>>> that we are missing the per-client roll-up metrics today, but
>>>>>> I'm
>>>>>>>>>> convinced
>>>>>>>>>>> that the right way to approach it would be
>>>>>>>> "just-providing-the-lowest-
>>>>>>>>>>> level
>>>>>>>>>>> metrics only".
>>>>>>>>>>>
>>>>>>>>>>> Note the recoding implementation of these three levels are
>>>>>>> different
>>>>>>>>>>> internally today: we did not just do the rolling up to generate
>>>>>> the
>>>>>>>>>>> higher-level metrics from the lower level ones, but we just
>>>>>> record
>>>>>>>> them
>>>>>>>>>>> separately, which means that, if we turn on multiple levels of
>>>>>>>> metrics,
>>>>>>>>>> we
>>>>>>>>>>> maybe duplicate collecting some metrics. One can argue that is
>>>>>> not
>>>>>>>> the
>>>>>>>>>> best
>>>>>>>>>>> way to represent multi-level metrics collecting and reporting,
>>>>>> but
>>>>>>> by
>>>>>>>>>> only
>>>>>>>>>>> enabling thread-level metrics as INFO today, that implementation
>>>>>>>> could
>>>>>>>>> be
>>>>>>>>>>> more efficient than only collecting the metrics at the lowest
>>>>>>> level,
>>>>>>>>> and
>>>>>>>>>>> then do the roll-up calculations outside of the metrics classes.
>>>>>>>>>>>
>>>>>>>>>>> Plus, today not all processor-nodes may possibly skip records,
>>>>>>> AFAIK
>>>>>>>> we
>>>>>>>>>>> will only skip records at the source, sink, window and
>>>>>> aggregation
>>>>>>>>>>> processor nodes, so adding a metric per processor looks like an
>>>>>>>>> overkill
>>>>>>>>>> to
>>>>>>>>>>> me as well. On the other hand, from user's perspective the
>>>>>> "reason"
>>>>>>>> tag
>>>>>>>>>> may
>>>>>>>>>>> be sufficient for them to narrow down where inside the topology
>>>>>> is
>>>>>>>>>> causing
>>>>>>>>>>> records to be dropped on the floor. So I think the "per-thread,
>>>>>>>>> per-task"
>>>>>>>>>>> level metrics should be sufficient for them in trouble shoot in
>>>>>>> DEBUG
>>>>>>>>>> mode,
>>>>>>>>>>> and we can add another "per-thread" level metrics as INFO which
>>>>>> is
>>>>>>>>> turned
>>>>>>>>>>> on by default. So under normal execution users still only need
>>>>>> INFO
>>>>>>>>> level
>>>>>>>>>>> metrics for alerting (e.g. set alerts on all skipped-records
>>>>>>> metrics
>>>>>>>> as
>>>>>>>>>>> non-zero), and then upon trouble shooting they can turn on DEBUG
>>>>>>>>> metrics
>>>>>>>>>> to
>>>>>>>>>>> look into which task is actually causing the skipped records.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Guozhang
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Thu, Mar 29, 2018 at 2:03 PM, Matthias J. Sax <
>>>>>>>>> matthias@confluent.io>
>>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Thanks for the KIP John.
>>>>>>>>>>>>
>>>>>>>>>>>> Reading the material on the related Jiras, I am wondering what
>>>>>>>>> `reason`
>>>>>>>>>>>> tags you want to introduce? Can you elaborate? The KIP should
>>>>>>> list
>>>>>>>>>> those
>>>>>>>>>>>> IMHO.
>>>>>>>>>>>>
>>>>>>>>>>>> About the fine grained metrics vs the roll-up: you say that
>>>>>>>>>>>>
>>>>>>>>>>>>> the coarse metric aggregates across two dimensions
>>>>>>> simultaneously
>>>>>>>>>>>>
>>>>>>>>>>>> Can you elaborate why this is an issue? I am not convinced atm
>>>>>>> that
>>>>>>>>> we
>>>>>>>>>>>> should put the fine grained metrics into INFO level and remove
>>>>>>> the
>>>>>>>>>>>> roll-up at thread level.
>>>>>>>>>>>>
>>>>>>>>>>>>> Given that they have to do this sum to get a usable
>>>>>> top-level
>>>>>>>> view
>>>>>>>>>>>>
>>>>>>>>>>>> This is a fair concern, but I don't share the conclusion.
>>>>>>> Offering
>>>>>>>> a
>>>>>>>>>>>> built-in `KafkaStreams` "client" roll-up out of the box might
>>>>>> be
>>>>>>> a
>>>>>>>>>>>> better solution. In the past we did not offer this due to
>>>>>>>> performance
>>>>>>>>>>>> concerns, but we could allow an "opt-in" mechanism. If you
>>>>>>>> disagree,
>>>>>>>>>> can
>>>>>>>>>>>> you provide some reasoning and add them to the "Rejected
>>>>>>>>> alternatives"
>>>>>>>>>>>> section.
>>>>>>>>>>>>
>>>>>>>>>>>> To rephrase: I understand the issue about missing top-level
>>>>>> view,
>>>>>>>> but
>>>>>>>>>>>> instead of going more fine grained, we should consider to add
>>>>>>> this
>>>>>>>>>>>> top-level view and add/keep the fine grained metrics at DEBUG
>>>>>>> level
>>>>>>>>>> only
>>>>>>>>>>>>
>>>>>>>>>>>> I am +1 to add TopologyTestDriver#metrics() and to remove old
>>>>>>>> metrics
>>>>>>>>>>>> directly as you suggested.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> -Matthias
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On 3/28/18 6:42 PM, Ted Yu wrote:
>>>>>>>>>>>>> Looks good to me.
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Wed, Mar 28, 2018 at 3:11 PM, John Roesler <
>>>>>>> john@confluent.io
>>>>>>>>>
>>>>>>>>>>> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Hello all,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I am proposing KIP-274 to improve the metrics around
>>>>>> skipped
>>>>>>>>> records
>>>>>>>>>>> in
>>>>>>>>>>>>>> Streams.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Please find the details here:
>>>>>>>>>>>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-
>>>>>>>>>>>>>> 274%3A+Kafka+Streams+Skipped+Records+Metrics
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Please let me know what you think!
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>> -John
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> --
>>>>>>>>>>> -- Guozhang
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> -- Guozhang
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> -- Guozhang
>>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>>
>>
> 
> 
>

Re: [DISCUSS] KIP-274: Kafka Streams Skipped Records Metrics

Posted by Guozhang Wang <wa...@gmail.com>.

Regarding logging: I'm inclined to keep logging at WARN level since skipped
records are not expected in normal execution (for all reasons that we are
aware of), and hence when error happens users should be alerted from
metrics and looked into the log files, so to me if it is really spamming
the log files it is also a good alert for users. Besides for deserialize
errors we already log at WARN level for this reason.

Regarding the metrics-levels: I was pondering on that as well. What made me
to think and agree on task-level than thread-level is that for some reasons
like window retention, they may possibly be happening on a subset of input
partitions, and tasks are correlated with partitions the task-level metrics
can help users to narrow down on the specific input data partitions.


Guozhang


On Mon, Apr 2, 2018 at 6:43 PM, John Roesler <jo...@confluent.io> wrote:

> Hi Matthias,
>
> No worries! Thanks for the reply.
>
> 1) There isn't a connection. I tried using the TopologyTestDriver to write
> a quick test exercising the current behavior and discovered that the
> metrics weren't available. It seemed like they should be, so I tacked it on
> to this KIP. If you feel it's inappropriate, I can pull it back out.
>
> 2) I was also concerned about that, but I figured it would come up in
> discussion if I just went ahead and proposed it. And here we are!
>
> Here's my thought: maybe there are two classes of skips: "controlled" and
> "uncontrolled", where "controlled" means, as an app author, I deliberately
> filter out some events, and "uncontrolled" means that I simply don't
> account for some feature of the data, and the framework skips them (as
> opposed to crashing).
>
> In this breakdowns, the skips I'm adding metrics for are all uncontrolled
> skips (and we hope to measure all the uncontrolled skips). Our skips are
> well documented, so it wouldn't be terrible to have an application in which
> you know you expect to have tons of uncontrolled skips, but it's not great
> either, since you may also have some *unexpected* uncontrolled skips. It'll
> be difficult to notice, since you're probably not alerting on the metric
> and filtering out the logs (whatever their level).
>
> I'd recommend any app author, as an alternative, to convert all expected
> skips to controlled ones, by updating the topology to filter those records
> out.
>
> Following from my recommendation, as a library author, I'm inclined to mark
> those logs WARN, since in my opinion, they should be concerning to the app
> authors. I'd definitely want to show, rather than hide, them by default, so
> I would pick INFO at least.
>
> That said, logging is always a tricky issue for lower-level libraries that
> run inside user code, since we don't have all the information we need to
> make the right call.
>
>
>
> On your last note, yeah, I got that impression from Guozhang as well.
> Thanks for the clarification.
>
> -John
>
>
>
> On Mon, Apr 2, 2018 at 4:03 PM, Matthias J. Sax <ma...@confluent.io>
> wrote:
>
> > John,
> >
> > sorry for my late reply and thanks for updating the KIP.
> >
> > I like your approach about "metrics are for monitoring, logs are for
> > debugging" -- however:
> >
> > 1) I don't see a connection between this and the task-level metrics that
> > you propose to get the metrics in `TopologyTestDriver`. I don't think
> > people would monitor the `TopologyTestDriver` an thus wondering why it
> > is important to include the metrics there? Thread-level metric might be
> > easier to monitor though (ie, less different metric to monitor).
> >
> > 2) I am a little worried about WARN level logging and that it might be
> > too chatty -- as you pointed out, it's about debugging, thus DEBUG level
> > might be better. Not 100% sure about this to be honest. What is the
> > general assumption about the frequency for skipped records? I could
> > imagine cases for which skipped records are quite frequent and thus,
> > WARN level logs might "flood" the logs
> >
> > One final remark:
> >
> > > More
> > > generally, I would like to establish a pattern in which we could add
> new
> > > values for the "reason" tags without needing a KIP to do so.
> >
> > From my understanding, this is not feasible. Changing metrics is always
> > considered a public API change, and we need a KIP for any change. As we
> > moved away from tagging, it doesn't matter for the KIP anymore -- just
> > wanted to point it out.
> >
> >
> > -Matthias
> >
> >
> > On 3/30/18 2:47 PM, John Roesler wrote:
> > > Allrighty! The KIP is updated.
> > >
> > > Thanks again, all, for the feedback.
> > > -John
> > >
> > > On Fri, Mar 30, 2018 at 3:35 PM, John Roesler <jo...@confluent.io>
> wrote:
> > >
> > >> Hey Guozhang and Bill,
> > >>
> > >> Ok, I'll update the KIP. At the risk of disturbing consensus, I'd like
> > to
> > >> put it in the task instead of the thread so that it'll show up in the
> > >> TopologyTestDriver metrics as well.
> > >>
> > >> I'm leaning toward keeping the scope where it is right now, but if
> > others
> > >> want to advocate for tossing in some more metrics, we can go that
> route.
> > >>
> > >> Thanks all,
> > >> -John
> > >>
> > >> On Fri, Mar 30, 2018 at 2:37 PM, Bill Bejeck <bb...@gmail.com>
> wrote:
> > >>
> > >>> Thanks for the KIP John, and sorry for the late comments.
> > >>>
> > >>> I'm on the fence with providing a single level metrics, but I think
> > we'll
> > >>> have that discussion outside of this KIP.
> > >>>
> > >>>> * maintain one skipped-record metric (could be per-thread, per-task,
> > or
> > >>>> per-processor-node) with no "reason"
> > >>>> * introduce a warn-level log detailing the topic/partition/offset
> and
> > >>>> reason of the skipped record
> > >>>
> > >>> I'm +1 on both of these suggestions.
> > >>>
> > >>> Finally, we have had requests in the past for some metrics around
> when
> > >>> persistent store removes an expired window.  Would adding that to our
> > >>> metrics stretch the scope of this KIP too much?
> > >>>
> > >>> Thanks again and overall I'm +1 on this KIP
> > >>>
> > >>> Bill
> > >>>
> > >>> On Fri, Mar 30, 2018 at 2:00 PM, Guozhang Wang <wa...@gmail.com>
> > >>> wrote:
> > >>>
> > >>>> The proposal sounds good to me. About "maintain only one level of
> > >>> metrics"
> > >>>> maybe we can discuss about that separately from this KIP since that
> > >>> would
> > >>>> be a larger scope of discussion. I agree that if we are going to
> > >>> maintain
> > >>>> only one-level metrics it should be lowest level and we would let
> > users
> > >>> to
> > >>>> do the roll-ups themselves, but I'm still not fully convinced that
> we
> > >>>> should just provide single-level metrics, because 1) I think for
> > >>> different
> > >>>> metrics people may be interested to investigate into different
> > >>>> granularities, e.g. for poll / commit rate these are at the lowest
> > >>>> task-level metrics, while for process-rate / skip-rate they can be
> as
> > >>> low
> > >>>> as processor-node metrics, and 2) user-side rolling ups may not be
> > very
> > >>>> straight-forward. But for 2) if someone can provide an efficient and
> > >>> easy
> > >>>> implementation of that I can be persuaded :)
> > >>>>
> > >>>> For now I'm thinking we can add the metric on thread-level, either
> > with
> > >>>> finer grained ones with "reason" tag plus an aggregated one without
> > the
> > >>>> tag, or just having a single aggregated metric without the tag looks
> > >>> good
> > >>>> to me.
> > >>>>
> > >>>>
> > >>>> Guozhang
> > >>>>
> > >>>> On Fri, Mar 30, 2018 at 8:05 AM, John Roesler <jo...@confluent.io>
> > >>> wrote:
> > >>>>
> > >>>>> Hey Guozhang,
> > >>>>>
> > >>>>> Thanks for the reply. Regarding JMX, I can dig it. I'll provide a
> > >>> list in
> > >>>>> the KIP. I was also thinking we'd better start a documentation page
> > >>> with
> > >>>>> the metrics listed.
> > >>>>>
> > >>>>> I'd have no problem logging a warning when we skip records. On the
> > >>> metric
> > >>>>> front, really I'm just pushing for us to maintain only one level of
> > >>>>> metrics. If that's more or less granular (i.e., maybe we don't
> have a
> > >>>>> metric per reason and log the reason instead), that's fine by me. I
> > >>> just
> > >>>>> don't think it provides a lot of extra value per complexity
> > (interface
> > >>>> and
> > >>>>> implementation) to maintain roll-ups at the thread level in
> addition
> > >>> to
> > >>>>> lower-level metrics.
> > >>>>>
> > >>>>> How about this instead:
> > >>>>> * maintain one skipped-record metric (could be per-thread,
> per-task,
> > >>> or
> > >>>>> per-processor-node) with no "reason"
> > >>>>> * introduce a warn-level log detailing the topic/partition/offset
> and
> > >>>>> reason of the skipped record
> > >>>>>
> > >>>>> If you like that, I can update the KIP.
> > >>>>>
> > >>>>> Thanks,
> > >>>>> -John
> > >>>>>
> > >>>>>
> > >>>>>
> > >>>>> On Thu, Mar 29, 2018 at 6:22 PM, Guozhang Wang <wangguoz@gmail.com
> >
> > >>>> wrote:
> > >>>>>
> > >>>>>>> One thing you mention is the notion of setting alerts on coarser
> > >>>>> metrics
> > >>>>>> being easier than finer ones. All the metric alerting systems I
> have
> > >>>> used
> > >>>>>> make it equally easy to alert on metrics by-tag or over tags. So
> my
> > >>>>>> experience doesn't say that this is a use case. Were you thinking
> > >>> of an
> > >>>>>> alerting system that makes such a pre-aggregation valuable?
> > >>>>>>
> > >>>>>> For the commonly used JMX reporter tags will be encoded directly
> as
> > >>>> part
> > >>>>> of
> > >>>>>> the object name, and if users wants to monitor them they need to
> > >>> know
> > >>>>> these
> > >>>>>> values before hand. That is also why I think we do want to list
> all
> > >>> the
> > >>>>>> possible values of the reason tags in the KIP, since
> > >>>>>>
> > >>>>>>> In my email in response to Matthias, I gave an example of the
> > >>> kind of
> > >>>>>> scenario that would lead me as an operator to run with DEBUG on
> all
> > >>> the
> > >>>>>> time, since I wouldn't be sure, having seen a skipped record once,
> > >>> that
> > >>>>> it
> > >>>>>> would ever happen again. The solution is to capture all the
> > >>> available
> > >>>>>> information about the reason and location of skips all the time.
> > >>>>>>
> > >>>>>> That is a good point. I think we can either expose all levels
> > >>> metrics
> > >>>> as
> > >>>>> by
> > >>>>>> default, or only expose the most lower-level metrics and get rid
> of
> > >>>> other
> > >>>>>> levels to let users do roll-ups themselves (which will be a much
> > >>> larger
> > >>>>>> scope for discussion), or we can encourage users to not purely
> > >>> depend
> > >>>> on
> > >>>>>> metrics for such trouble shooting: that is to say, users only be
> > >>>> alerted
> > >>>>>> based on metrics, and we can log a info / warn log4j entry each
> > >>> time we
> > >>>>> are
> > >>>>>> about to skip a record all over the places, so that upon being
> > >>> notified
> > >>>>>> users can look into the logs to find the details on where / when
> it
> > >>>>>> happens. WDYT?
> > >>>>>>
> > >>>>>>
> > >>>>>> Guozhang
> > >>>>>>
> > >>>>>>
> > >>>>>>
> > >>>>>> On Thu, Mar 29, 2018 at 3:57 PM, John Roesler <jo...@confluent.io>
> > >>>> wrote:
> > >>>>>>
> > >>>>>>> Hey Guozhang,
> > >>>>>>>
> > >>>>>>> Thanks for the review.
> > >>>>>>>
> > >>>>>>> 1.
> > >>>>>>> Matthias raised the same question about the "reason" tag values.
> I
> > >>>> can
> > >>>>>> list
> > >>>>>>> all possible values of the "reason" tag, but I'm thinking this
> > >>> level
> > >>>> of
> > >>>>>>> detail may not be KIP-worthy, maybe the code and documentation
> > >>> review
> > >>>>>> would
> > >>>>>>> be sufficient. If you all disagree and would like it included in
> > >>> the
> > >>>>>> KIP, I
> > >>>>>>> can certainly do that.
> > >>>>>>>
> > >>>>>>> If we do provide roll-up metrics, I agree with the pattern of
> > >>> keeping
> > >>>>> the
> > >>>>>>> same name but eliminating the tags for the dimensions that were
> > >>>>>> rolled-up.
> > >>>>>>>
> > >>>>>>> 2.
> > >>>>>>> I'm not too sure that implementation efficiency really becomes a
> > >>>> factor
> > >>>>>> in
> > >>>>>>> choosing whether to (by default) update one coarse metric at the
> > >>>> thread
> > >>>>>>> level or one granular metric at the processor-node level, since
> > >>> it's
> > >>>>> just
> > >>>>>>> one metric being updated either way. I do agree that if we were
> to
> > >>>>> update
> > >>>>>>> the granular metrics and multiple roll-ups, then we should
> > >>> consider
> > >>>> the
> > >>>>>>> efficiency.
> > >>>>>>>
> > >>>>>>> I agree it's probably not necessary to surface the metrics for
> all
> > >>>>> nodes
> > >>>>>>> regardless of whether they can or do skip records. Perhaps we can
> > >>>>> lazily
> > >>>>>>> register the metrics.
> > >>>>>>>
> > >>>>>>> In my email in response to Matthias, I gave an example of the
> > >>> kind of
> > >>>>>>> scenario that would lead me as an operator to run with DEBUG on
> > >>> all
> > >>>> the
> > >>>>>>> time, since I wouldn't be sure, having seen a skipped record
> once,
> > >>>> that
> > >>>>>> it
> > >>>>>>> would ever happen again. The solution is to capture all the
> > >>> available
> > >>>>>>> information about the reason and location of skips all the time.
> > >>>>>>>
> > >>>>>>>
> > >>>>>>>
> > >>>>>>> One thing you mention is the notion of setting alerts on coarser
> > >>>>> metrics
> > >>>>>>> being easier than finer ones. All the metric alerting systems I
> > >>> have
> > >>>>> used
> > >>>>>>> make it equally easy to alert on metrics by-tag or over tags. So
> > >>> my
> > >>>>>>> experience doesn't say that this is a use case. Were you thinking
> > >>> of
> > >>>> an
> > >>>>>>> alerting system that makes such a pre-aggregation valuable?
> > >>>>>>>
> > >>>>>>> Thanks again,
> > >>>>>>> -John
> > >>>>>>>
> > >>>>>>>
> > >>>>>>> On Thu, Mar 29, 2018 at 5:24 PM, Guozhang Wang <
> > >>> wangguoz@gmail.com>
> > >>>>>> wrote:
> > >>>>>>>
> > >>>>>>>> Hello John,
> > >>>>>>>>
> > >>>>>>>> Thanks for the KIP. Some comments:
> > >>>>>>>>
> > >>>>>>>> 1. Could you list all the possible values of the "reason" tag?
> > >>> In
> > >>>> the
> > >>>>>>> JIRA
> > >>>>>>>> ticket I left some potential reasons but I'm not clear if you're
> > >>>>> going
> > >>>>>> to
> > >>>>>>>> categorize each of them as a separate reason, or is there any
> > >>>>>> additional
> > >>>>>>>> ones you have in mind.
> > >>>>>>>>
> > >>>>>>>> Also I'm wondering if we should add another metric that do not
> > >>> have
> > >>>>> the
> > >>>>>>>> reason tag but aggregates among all possible reasons? This is
> > >>> for
> > >>>>> users
> > >>>>>>> to
> > >>>>>>>> easily set their alerting notifications (otherwise they have to
> > >>>> write
> > >>>>>> on
> > >>>>>>>> notification rule per reason) in their monitoring systems.
> > >>>>>>>>
> > >>>>>>>> 2. Note that the processor-node metrics is actually "per-thread,
> > >>>>>>> per-task,
> > >>>>>>>> per-processor-node", and today we only set the per-thread
> > >>> metrics
> > >>>> as
> > >>>>>> INFO
> > >>>>>>>> while leaving the lower two layers as DEBUG. I agree with your
> > >>>>> argument
> > >>>>>>>> that we are missing the per-client roll-up metrics today, but
> > >>> I'm
> > >>>>>>> convinced
> > >>>>>>>> that the right way to approach it would be
> > >>>>> "just-providing-the-lowest-
> > >>>>>>>> level
> > >>>>>>>> metrics only".
> > >>>>>>>>
> > >>>>>>>> Note the recoding implementation of these three levels are
> > >>>> different
> > >>>>>>>> internally today: we did not just do the rolling up to generate
> > >>> the
> > >>>>>>>> higher-level metrics from the lower level ones, but we just
> > >>> record
> > >>>>> them
> > >>>>>>>> separately, which means that, if we turn on multiple levels of
> > >>>>> metrics,
> > >>>>>>> we
> > >>>>>>>> maybe duplicate collecting some metrics. One can argue that is
> > >>> not
> > >>>>> the
> > >>>>>>> best
> > >>>>>>>> way to represent multi-level metrics collecting and reporting,
> > >>> but
> > >>>> by
> > >>>>>>> only
> > >>>>>>>> enabling thread-level metrics as INFO today, that implementation
> > >>>>> could
> > >>>>>> be
> > >>>>>>>> more efficient than only collecting the metrics at the lowest
> > >>>> level,
> > >>>>>> and
> > >>>>>>>> then do the roll-up calculations outside of the metrics classes.
> > >>>>>>>>
> > >>>>>>>> Plus, today not all processor-nodes may possibly skip records,
> > >>>> AFAIK
> > >>>>> we
> > >>>>>>>> will only skip records at the source, sink, window and
> > >>> aggregation
> > >>>>>>>> processor nodes, so adding a metric per processor looks like an
> > >>>>>> overkill
> > >>>>>>> to
> > >>>>>>>> me as well. On the other hand, from user's perspective the
> > >>> "reason"
> > >>>>> tag
> > >>>>>>> may
> > >>>>>>>> be sufficient for them to narrow down where inside the topology
> > >>> is
> > >>>>>>> causing
> > >>>>>>>> records to be dropped on the floor. So I think the "per-thread,
> > >>>>>> per-task"
> > >>>>>>>> level metrics should be sufficient for them in trouble shoot in
> > >>>> DEBUG
> > >>>>>>> mode,
> > >>>>>>>> and we can add another "per-thread" level metrics as INFO which
> > >>> is
> > >>>>>> turned
> > >>>>>>>> on by default. So under normal execution users still only need
> > >>> INFO
> > >>>>>> level
> > >>>>>>>> metrics for alerting (e.g. set alerts on all skipped-records
> > >>>> metrics
> > >>>>> as
> > >>>>>>>> non-zero), and then upon trouble shooting they can turn on DEBUG
> > >>>>>> metrics
> > >>>>>>> to
> > >>>>>>>> look into which task is actually causing the skipped records.
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>>> Guozhang
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>>> On Thu, Mar 29, 2018 at 2:03 PM, Matthias J. Sax <
> > >>>>>> matthias@confluent.io>
> > >>>>>>>> wrote:
> > >>>>>>>>
> > >>>>>>>>> Thanks for the KIP John.
> > >>>>>>>>>
> > >>>>>>>>> Reading the material on the related Jiras, I am wondering what
> > >>>>>> `reason`
> > >>>>>>>>> tags you want to introduce? Can you elaborate? The KIP should
> > >>>> list
> > >>>>>>> those
> > >>>>>>>>> IMHO.
> > >>>>>>>>>
> > >>>>>>>>> About the fine grained metrics vs the roll-up: you say that
> > >>>>>>>>>
> > >>>>>>>>>> the coarse metric aggregates across two dimensions
> > >>>> simultaneously
> > >>>>>>>>>
> > >>>>>>>>> Can you elaborate why this is an issue? I am not convinced atm
> > >>>> that
> > >>>>>> we
> > >>>>>>>>> should put the fine grained metrics into INFO level and remove
> > >>>> the
> > >>>>>>>>> roll-up at thread level.
> > >>>>>>>>>
> > >>>>>>>>>> Given that they have to do this sum to get a usable
> > >>> top-level
> > >>>>> view
> > >>>>>>>>>
> > >>>>>>>>> This is a fair concern, but I don't share the conclusion.
> > >>>> Offering
> > >>>>> a
> > >>>>>>>>> built-in `KafkaStreams` "client" roll-up out of the box might
> > >>> be
> > >>>> a
> > >>>>>>>>> better solution. In the past we did not offer this due to
> > >>>>> performance
> > >>>>>>>>> concerns, but we could allow an "opt-in" mechanism. If you
> > >>>>> disagree,
> > >>>>>>> can
> > >>>>>>>>> you provide some reasoning and add them to the "Rejected
> > >>>>>> alternatives"
> > >>>>>>>>> section.
> > >>>>>>>>>
> > >>>>>>>>> To rephrase: I understand the issue about missing top-level
> > >>> view,
> > >>>>> but
> > >>>>>>>>> instead of going more fine grained, we should consider to add
> > >>>> this
> > >>>>>>>>> top-level view and add/keep the fine grained metrics at DEBUG
> > >>>> level
> > >>>>>>> only
> > >>>>>>>>>
> > >>>>>>>>> I am +1 to add TopologyTestDriver#metrics() and to remove old
> > >>>>> metrics
> > >>>>>>>>> directly as you suggested.
> > >>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>> -Matthias
> > >>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>> On 3/28/18 6:42 PM, Ted Yu wrote:
> > >>>>>>>>>> Looks good to me.
> > >>>>>>>>>>
> > >>>>>>>>>> On Wed, Mar 28, 2018 at 3:11 PM, John Roesler <
> > >>>> john@confluent.io
> > >>>>>>
> > >>>>>>>> wrote:
> > >>>>>>>>>>
> > >>>>>>>>>>> Hello all,
> > >>>>>>>>>>>
> > >>>>>>>>>>> I am proposing KIP-274 to improve the metrics around
> > >>> skipped
> > >>>>>> records
> > >>>>>>>> in
> > >>>>>>>>>>> Streams.
> > >>>>>>>>>>>
> > >>>>>>>>>>> Please find the details here:
> > >>>>>>>>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-
> > >>>>>>>>>>> 274%3A+Kafka+Streams+Skipped+Records+Metrics
> > >>>>>>>>>>>
> > >>>>>>>>>>> Please let me know what you think!
> > >>>>>>>>>>>
> > >>>>>>>>>>> Thanks,
> > >>>>>>>>>>> -John
> > >>>>>>>>>>>
> > >>>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>>> --
> > >>>>>>>> -- Guozhang
> > >>>>>>>>
> > >>>>>>>
> > >>>>>>
> > >>>>>>
> > >>>>>>
> > >>>>>> --
> > >>>>>> -- Guozhang
> > >>>>>>
> > >>>>>
> > >>>>
> > >>>>
> > >>>>
> > >>>> --
> > >>>> -- Guozhang
> > >>>>
> > >>>
> > >>
> > >>
> > >
> >
> >
>



-- 
-- Guozhang

Re: [DISCUSS] KIP-274: Kafka Streams Skipped Records Metrics

Posted by John Roesler <jo...@confluent.io>.

Hi Matthias,

No worries! Thanks for the reply.

1) There isn't a connection. I tried using the TopologyTestDriver to write
a quick test exercising the current behavior and discovered that the
metrics weren't available. It seemed like they should be, so I tacked it on
to this KIP. If you feel it's inappropriate, I can pull it back out.

2) I was also concerned about that, but I figured it would come up in
discussion if I just went ahead and proposed it. And here we are!

Here's my thought: maybe there are two classes of skips: "controlled" and
"uncontrolled", where "controlled" means, as an app author, I deliberately
filter out some events, and "uncontrolled" means that I simply don't
account for some feature of the data, and the framework skips them (as
opposed to crashing).

In this breakdowns, the skips I'm adding metrics for are all uncontrolled
skips (and we hope to measure all the uncontrolled skips). Our skips are
well documented, so it wouldn't be terrible to have an application in which
you know you expect to have tons of uncontrolled skips, but it's not great
either, since you may also have some *unexpected* uncontrolled skips. It'll
be difficult to notice, since you're probably not alerting on the metric
and filtering out the logs (whatever their level).

I'd recommend any app author, as an alternative, to convert all expected
skips to controlled ones, by updating the topology to filter those records
out.

Following from my recommendation, as a library author, I'm inclined to mark
those logs WARN, since in my opinion, they should be concerning to the app
authors. I'd definitely want to show, rather than hide, them by default, so
I would pick INFO at least.

That said, logging is always a tricky issue for lower-level libraries that
run inside user code, since we don't have all the information we need to
make the right call.



On your last note, yeah, I got that impression from Guozhang as well.
Thanks for the clarification.

-John



On Mon, Apr 2, 2018 at 4:03 PM, Matthias J. Sax <ma...@confluent.io>
wrote:

> John,
>
> sorry for my late reply and thanks for updating the KIP.
>
> I like your approach about "metrics are for monitoring, logs are for
> debugging" -- however:
>
> 1) I don't see a connection between this and the task-level metrics that
> you propose to get the metrics in `TopologyTestDriver`. I don't think
> people would monitor the `TopologyTestDriver` an thus wondering why it
> is important to include the metrics there? Thread-level metric might be
> easier to monitor though (ie, less different metric to monitor).
>
> 2) I am a little worried about WARN level logging and that it might be
> too chatty -- as you pointed out, it's about debugging, thus DEBUG level
> might be better. Not 100% sure about this to be honest. What is the
> general assumption about the frequency for skipped records? I could
> imagine cases for which skipped records are quite frequent and thus,
> WARN level logs might "flood" the logs
>
> One final remark:
>
> > More
> > generally, I would like to establish a pattern in which we could add new
> > values for the "reason" tags without needing a KIP to do so.
>
> From my understanding, this is not feasible. Changing metrics is always
> considered a public API change, and we need a KIP for any change. As we
> moved away from tagging, it doesn't matter for the KIP anymore -- just
> wanted to point it out.
>
>
> -Matthias
>
>
> On 3/30/18 2:47 PM, John Roesler wrote:
> > Allrighty! The KIP is updated.
> >
> > Thanks again, all, for the feedback.
> > -John
> >
> > On Fri, Mar 30, 2018 at 3:35 PM, John Roesler <jo...@confluent.io> wrote:
> >
> >> Hey Guozhang and Bill,
> >>
> >> Ok, I'll update the KIP. At the risk of disturbing consensus, I'd like
> to
> >> put it in the task instead of the thread so that it'll show up in the
> >> TopologyTestDriver metrics as well.
> >>
> >> I'm leaning toward keeping the scope where it is right now, but if
> others
> >> want to advocate for tossing in some more metrics, we can go that route.
> >>
> >> Thanks all,
> >> -John
> >>
> >> On Fri, Mar 30, 2018 at 2:37 PM, Bill Bejeck <bb...@gmail.com> wrote:
> >>
> >>> Thanks for the KIP John, and sorry for the late comments.
> >>>
> >>> I'm on the fence with providing a single level metrics, but I think
> we'll
> >>> have that discussion outside of this KIP.
> >>>
> >>>> * maintain one skipped-record metric (could be per-thread, per-task,
> or
> >>>> per-processor-node) with no "reason"
> >>>> * introduce a warn-level log detailing the topic/partition/offset and
> >>>> reason of the skipped record
> >>>
> >>> I'm +1 on both of these suggestions.
> >>>
> >>> Finally, we have had requests in the past for some metrics around when
> >>> persistent store removes an expired window.  Would adding that to our
> >>> metrics stretch the scope of this KIP too much?
> >>>
> >>> Thanks again and overall I'm +1 on this KIP
> >>>
> >>> Bill
> >>>
> >>> On Fri, Mar 30, 2018 at 2:00 PM, Guozhang Wang <wa...@gmail.com>
> >>> wrote:
> >>>
> >>>> The proposal sounds good to me. About "maintain only one level of
> >>> metrics"
> >>>> maybe we can discuss about that separately from this KIP since that
> >>> would
> >>>> be a larger scope of discussion. I agree that if we are going to
> >>> maintain
> >>>> only one-level metrics it should be lowest level and we would let
> users
> >>> to
> >>>> do the roll-ups themselves, but I'm still not fully convinced that we
> >>>> should just provide single-level metrics, because 1) I think for
> >>> different
> >>>> metrics people may be interested to investigate into different
> >>>> granularities, e.g. for poll / commit rate these are at the lowest
> >>>> task-level metrics, while for process-rate / skip-rate they can be as
> >>> low
> >>>> as processor-node metrics, and 2) user-side rolling ups may not be
> very
> >>>> straight-forward. But for 2) if someone can provide an efficient and
> >>> easy
> >>>> implementation of that I can be persuaded :)
> >>>>
> >>>> For now I'm thinking we can add the metric on thread-level, either
> with
> >>>> finer grained ones with "reason" tag plus an aggregated one without
> the
> >>>> tag, or just having a single aggregated metric without the tag looks
> >>> good
> >>>> to me.
> >>>>
> >>>>
> >>>> Guozhang
> >>>>
> >>>> On Fri, Mar 30, 2018 at 8:05 AM, John Roesler <jo...@confluent.io>
> >>> wrote:
> >>>>
> >>>>> Hey Guozhang,
> >>>>>
> >>>>> Thanks for the reply. Regarding JMX, I can dig it. I'll provide a
> >>> list in
> >>>>> the KIP. I was also thinking we'd better start a documentation page
> >>> with
> >>>>> the metrics listed.
> >>>>>
> >>>>> I'd have no problem logging a warning when we skip records. On the
> >>> metric
> >>>>> front, really I'm just pushing for us to maintain only one level of
> >>>>> metrics. If that's more or less granular (i.e., maybe we don't have a
> >>>>> metric per reason and log the reason instead), that's fine by me. I
> >>> just
> >>>>> don't think it provides a lot of extra value per complexity
> (interface
> >>>> and
> >>>>> implementation) to maintain roll-ups at the thread level in addition
> >>> to
> >>>>> lower-level metrics.
> >>>>>
> >>>>> How about this instead:
> >>>>> * maintain one skipped-record metric (could be per-thread, per-task,
> >>> or
> >>>>> per-processor-node) with no "reason"
> >>>>> * introduce a warn-level log detailing the topic/partition/offset and
> >>>>> reason of the skipped record
> >>>>>
> >>>>> If you like that, I can update the KIP.
> >>>>>
> >>>>> Thanks,
> >>>>> -John
> >>>>>
> >>>>>
> >>>>>
> >>>>> On Thu, Mar 29, 2018 at 6:22 PM, Guozhang Wang <wa...@gmail.com>
> >>>> wrote:
> >>>>>
> >>>>>>> One thing you mention is the notion of setting alerts on coarser
> >>>>> metrics
> >>>>>> being easier than finer ones. All the metric alerting systems I have
> >>>> used
> >>>>>> make it equally easy to alert on metrics by-tag or over tags. So my
> >>>>>> experience doesn't say that this is a use case. Were you thinking
> >>> of an
> >>>>>> alerting system that makes such a pre-aggregation valuable?
> >>>>>>
> >>>>>> For the commonly used JMX reporter tags will be encoded directly as
> >>>> part
> >>>>> of
> >>>>>> the object name, and if users wants to monitor them they need to
> >>> know
> >>>>> these
> >>>>>> values before hand. That is also why I think we do want to list all
> >>> the
> >>>>>> possible values of the reason tags in the KIP, since
> >>>>>>
> >>>>>>> In my email in response to Matthias, I gave an example of the
> >>> kind of
> >>>>>> scenario that would lead me as an operator to run with DEBUG on all
> >>> the
> >>>>>> time, since I wouldn't be sure, having seen a skipped record once,
> >>> that
> >>>>> it
> >>>>>> would ever happen again. The solution is to capture all the
> >>> available
> >>>>>> information about the reason and location of skips all the time.
> >>>>>>
> >>>>>> That is a good point. I think we can either expose all levels
> >>> metrics
> >>>> as
> >>>>> by
> >>>>>> default, or only expose the most lower-level metrics and get rid of
> >>>> other
> >>>>>> levels to let users do roll-ups themselves (which will be a much
> >>> larger
> >>>>>> scope for discussion), or we can encourage users to not purely
> >>> depend
> >>>> on
> >>>>>> metrics for such trouble shooting: that is to say, users only be
> >>>> alerted
> >>>>>> based on metrics, and we can log a info / warn log4j entry each
> >>> time we
> >>>>> are
> >>>>>> about to skip a record all over the places, so that upon being
> >>> notified
> >>>>>> users can look into the logs to find the details on where / when it
> >>>>>> happens. WDYT?
> >>>>>>
> >>>>>>
> >>>>>> Guozhang
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> On Thu, Mar 29, 2018 at 3:57 PM, John Roesler <jo...@confluent.io>
> >>>> wrote:
> >>>>>>
> >>>>>>> Hey Guozhang,
> >>>>>>>
> >>>>>>> Thanks for the review.
> >>>>>>>
> >>>>>>> 1.
> >>>>>>> Matthias raised the same question about the "reason" tag values. I
> >>>> can
> >>>>>> list
> >>>>>>> all possible values of the "reason" tag, but I'm thinking this
> >>> level
> >>>> of
> >>>>>>> detail may not be KIP-worthy, maybe the code and documentation
> >>> review
> >>>>>> would
> >>>>>>> be sufficient. If you all disagree and would like it included in
> >>> the
> >>>>>> KIP, I
> >>>>>>> can certainly do that.
> >>>>>>>
> >>>>>>> If we do provide roll-up metrics, I agree with the pattern of
> >>> keeping
> >>>>> the
> >>>>>>> same name but eliminating the tags for the dimensions that were
> >>>>>> rolled-up.
> >>>>>>>
> >>>>>>> 2.
> >>>>>>> I'm not too sure that implementation efficiency really becomes a
> >>>> factor
> >>>>>> in
> >>>>>>> choosing whether to (by default) update one coarse metric at the
> >>>> thread
> >>>>>>> level or one granular metric at the processor-node level, since
> >>> it's
> >>>>> just
> >>>>>>> one metric being updated either way. I do agree that if we were to
> >>>>> update
> >>>>>>> the granular metrics and multiple roll-ups, then we should
> >>> consider
> >>>> the
> >>>>>>> efficiency.
> >>>>>>>
> >>>>>>> I agree it's probably not necessary to surface the metrics for all
> >>>>> nodes
> >>>>>>> regardless of whether they can or do skip records. Perhaps we can
> >>>>> lazily
> >>>>>>> register the metrics.
> >>>>>>>
> >>>>>>> In my email in response to Matthias, I gave an example of the
> >>> kind of
> >>>>>>> scenario that would lead me as an operator to run with DEBUG on
> >>> all
> >>>> the
> >>>>>>> time, since I wouldn't be sure, having seen a skipped record once,
> >>>> that
> >>>>>> it
> >>>>>>> would ever happen again. The solution is to capture all the
> >>> available
> >>>>>>> information about the reason and location of skips all the time.
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>> One thing you mention is the notion of setting alerts on coarser
> >>>>> metrics
> >>>>>>> being easier than finer ones. All the metric alerting systems I
> >>> have
> >>>>> used
> >>>>>>> make it equally easy to alert on metrics by-tag or over tags. So
> >>> my
> >>>>>>> experience doesn't say that this is a use case. Were you thinking
> >>> of
> >>>> an
> >>>>>>> alerting system that makes such a pre-aggregation valuable?
> >>>>>>>
> >>>>>>> Thanks again,
> >>>>>>> -John
> >>>>>>>
> >>>>>>>
> >>>>>>> On Thu, Mar 29, 2018 at 5:24 PM, Guozhang Wang <
> >>> wangguoz@gmail.com>
> >>>>>> wrote:
> >>>>>>>
> >>>>>>>> Hello John,
> >>>>>>>>
> >>>>>>>> Thanks for the KIP. Some comments:
> >>>>>>>>
> >>>>>>>> 1. Could you list all the possible values of the "reason" tag?
> >>> In
> >>>> the
> >>>>>>> JIRA
> >>>>>>>> ticket I left some potential reasons but I'm not clear if you're
> >>>>> going
> >>>>>> to
> >>>>>>>> categorize each of them as a separate reason, or is there any
> >>>>>> additional
> >>>>>>>> ones you have in mind.
> >>>>>>>>
> >>>>>>>> Also I'm wondering if we should add another metric that do not
> >>> have
> >>>>> the
> >>>>>>>> reason tag but aggregates among all possible reasons? This is
> >>> for
> >>>>> users
> >>>>>>> to
> >>>>>>>> easily set their alerting notifications (otherwise they have to
> >>>> write
> >>>>>> on
> >>>>>>>> notification rule per reason) in their monitoring systems.
> >>>>>>>>
> >>>>>>>> 2. Note that the processor-node metrics is actually "per-thread,
> >>>>>>> per-task,
> >>>>>>>> per-processor-node", and today we only set the per-thread
> >>> metrics
> >>>> as
> >>>>>> INFO
> >>>>>>>> while leaving the lower two layers as DEBUG. I agree with your
> >>>>> argument
> >>>>>>>> that we are missing the per-client roll-up metrics today, but
> >>> I'm
> >>>>>>> convinced
> >>>>>>>> that the right way to approach it would be
> >>>>> "just-providing-the-lowest-
> >>>>>>>> level
> >>>>>>>> metrics only".
> >>>>>>>>
> >>>>>>>> Note the recoding implementation of these three levels are
> >>>> different
> >>>>>>>> internally today: we did not just do the rolling up to generate
> >>> the
> >>>>>>>> higher-level metrics from the lower level ones, but we just
> >>> record
> >>>>> them
> >>>>>>>> separately, which means that, if we turn on multiple levels of
> >>>>> metrics,
> >>>>>>> we
> >>>>>>>> maybe duplicate collecting some metrics. One can argue that is
> >>> not
> >>>>> the
> >>>>>>> best
> >>>>>>>> way to represent multi-level metrics collecting and reporting,
> >>> but
> >>>> by
> >>>>>>> only
> >>>>>>>> enabling thread-level metrics as INFO today, that implementation
> >>>>> could
> >>>>>> be
> >>>>>>>> more efficient than only collecting the metrics at the lowest
> >>>> level,
> >>>>>> and
> >>>>>>>> then do the roll-up calculations outside of the metrics classes.
> >>>>>>>>
> >>>>>>>> Plus, today not all processor-nodes may possibly skip records,
> >>>> AFAIK
> >>>>> we
> >>>>>>>> will only skip records at the source, sink, window and
> >>> aggregation
> >>>>>>>> processor nodes, so adding a metric per processor looks like an
> >>>>>> overkill
> >>>>>>> to
> >>>>>>>> me as well. On the other hand, from user's perspective the
> >>> "reason"
> >>>>> tag
> >>>>>>> may
> >>>>>>>> be sufficient for them to narrow down where inside the topology
> >>> is
> >>>>>>> causing
> >>>>>>>> records to be dropped on the floor. So I think the "per-thread,
> >>>>>> per-task"
> >>>>>>>> level metrics should be sufficient for them in trouble shoot in
> >>>> DEBUG
> >>>>>>> mode,
> >>>>>>>> and we can add another "per-thread" level metrics as INFO which
> >>> is
> >>>>>> turned
> >>>>>>>> on by default. So under normal execution users still only need
> >>> INFO
> >>>>>> level
> >>>>>>>> metrics for alerting (e.g. set alerts on all skipped-records
> >>>> metrics
> >>>>> as
> >>>>>>>> non-zero), and then upon trouble shooting they can turn on DEBUG
> >>>>>> metrics
> >>>>>>> to
> >>>>>>>> look into which task is actually causing the skipped records.
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> Guozhang
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> On Thu, Mar 29, 2018 at 2:03 PM, Matthias J. Sax <
> >>>>>> matthias@confluent.io>
> >>>>>>>> wrote:
> >>>>>>>>
> >>>>>>>>> Thanks for the KIP John.
> >>>>>>>>>
> >>>>>>>>> Reading the material on the related Jiras, I am wondering what
> >>>>>> `reason`
> >>>>>>>>> tags you want to introduce? Can you elaborate? The KIP should
> >>>> list
> >>>>>>> those
> >>>>>>>>> IMHO.
> >>>>>>>>>
> >>>>>>>>> About the fine grained metrics vs the roll-up: you say that
> >>>>>>>>>
> >>>>>>>>>> the coarse metric aggregates across two dimensions
> >>>> simultaneously
> >>>>>>>>>
> >>>>>>>>> Can you elaborate why this is an issue? I am not convinced atm
> >>>> that
> >>>>>> we
> >>>>>>>>> should put the fine grained metrics into INFO level and remove
> >>>> the
> >>>>>>>>> roll-up at thread level.
> >>>>>>>>>
> >>>>>>>>>> Given that they have to do this sum to get a usable
> >>> top-level
> >>>>> view
> >>>>>>>>>
> >>>>>>>>> This is a fair concern, but I don't share the conclusion.
> >>>> Offering
> >>>>> a
> >>>>>>>>> built-in `KafkaStreams` "client" roll-up out of the box might
> >>> be
> >>>> a
> >>>>>>>>> better solution. In the past we did not offer this due to
> >>>>> performance
> >>>>>>>>> concerns, but we could allow an "opt-in" mechanism. If you
> >>>>> disagree,
> >>>>>>> can
> >>>>>>>>> you provide some reasoning and add them to the "Rejected
> >>>>>> alternatives"
> >>>>>>>>> section.
> >>>>>>>>>
> >>>>>>>>> To rephrase: I understand the issue about missing top-level
> >>> view,
> >>>>> but
> >>>>>>>>> instead of going more fine grained, we should consider to add
> >>>> this
> >>>>>>>>> top-level view and add/keep the fine grained metrics at DEBUG
> >>>> level
> >>>>>>> only
> >>>>>>>>>
> >>>>>>>>> I am +1 to add TopologyTestDriver#metrics() and to remove old
> >>>>> metrics
> >>>>>>>>> directly as you suggested.
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> -Matthias
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> On 3/28/18 6:42 PM, Ted Yu wrote:
> >>>>>>>>>> Looks good to me.
> >>>>>>>>>>
> >>>>>>>>>> On Wed, Mar 28, 2018 at 3:11 PM, John Roesler <
> >>>> john@confluent.io
> >>>>>>
> >>>>>>>> wrote:
> >>>>>>>>>>
> >>>>>>>>>>> Hello all,
> >>>>>>>>>>>
> >>>>>>>>>>> I am proposing KIP-274 to improve the metrics around
> >>> skipped
> >>>>>> records
> >>>>>>>> in
> >>>>>>>>>>> Streams.
> >>>>>>>>>>>
> >>>>>>>>>>> Please find the details here:
> >>>>>>>>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-
> >>>>>>>>>>> 274%3A+Kafka+Streams+Skipped+Records+Metrics
> >>>>>>>>>>>
> >>>>>>>>>>> Please let me know what you think!
> >>>>>>>>>>>
> >>>>>>>>>>> Thanks,
> >>>>>>>>>>> -John
> >>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> --
> >>>>>>>> -- Guozhang
> >>>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> --
> >>>>>> -- Guozhang
> >>>>>>
> >>>>>
> >>>>
> >>>>
> >>>>
> >>>> --
> >>>> -- Guozhang
> >>>>
> >>>
> >>
> >>
> >
>
>