You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@kafka.apache.org by Sachin Mittal <sj...@gmail.com> on 2017/03/02 11:54:09 UTC

Need some help in identifying some important metrics to monitor for streams

Hello All,
I had few questions regarding monitoring of kafka streams application and
what are some important metrics we should collect in our case.

Just a brief overview, we have a single thread application (0.10.1.1)
reading from single partition topic and it is working all fine.
Then we have same application (using 0.10.2.0) multi threaded with 4
threads per machine and 3 machines cluster setup reading for same but
partitioned topic (12 partitions).
Thus we have each thread processing single partition same case as earlier
one.

The new setup also works fine in steady state, but under load somehow it
triggers frequent re-balance and then we run into all sort of issues like
stream thread dying due to CommitFailedException or entering into deadlock
state.
After a while we restart all the instances then it works fine for a while
and again we get the same problem and it goes on.

1. So just to monitor, like when first thread fails what would be some
important metrics we should be collecting to get some sense of whats going
on?

2. Is there any metric that tells time elapsed between successive poll
requests, so we can monitor that?

Also I did monitor rocksdb put and fetch times for these 2 instances and
here is the output I get:
0.10.1.1
$>get -s  -b kafka.streams:type=stream-rocksdb-window-metrics,client-id=new-advice-1-StreamThread-1
key-table-put-avg-latency-ms
#mbean = kafka.streams:type=stream-rocksdb-window-metrics,client-
id=new-advice-1-StreamThread-1:
206431.7497615029
$>get -s  -b kafka.streams:type=stream-rocksdb-window-metrics,client-id=new-advice-1-StreamThread-1
key-table-fetch-avg-latency-ms
#mbean = kafka.streams:type=stream-rocksdb-window-metrics,client-
id=new-advice-1-StreamThread-1:
2595394.2746129474
$>get -s  -b kafka.streams:type=stream-rocksdb-window-metrics,client-id=new-advice-1-StreamThread-1
key-table-put-qps
#mbean = kafka.streams:type=stream-rocksdb-window-metrics,client-
id=new-advice-1-StreamThread-1:
232.86299499317252
$>get -s  -b kafka.streams:type=stream-rocksdb-window-metrics,client-id=new-advice-1-StreamThread-1
key-table-fetch-qps
#mbean = kafka.streams:type=stream-rocksdb-window-metrics,client-
id=new-advice-1-StreamThread-1:
373.61071016166284

Same values for 0.10.2.0 I get
$>get -s -b kafka.streams:type=stream-rocksdb-window-metrics,client-
id=new-part-advice-d1094e71-0f59-45e8-98f4-477f9444aa91-StreamThread-1
key-table-put-latency-avg
#mbean = kafka.streams:type=stream-rocksdb-window-metrics,client-
id=new-part-advice-d1094e71-0f59-45e8-98f4-477f9444aa91-StreamThread-1:
1199859.5535022356
$>get -s -b kafka.streams:type=stream-rocksdb-window-metrics,client-
id=new-part-advice-d1094e71-0f59-45e8-98f4-477f9444aa91-StreamThread-1
key-table-fetch-latency-avg
#mbean = kafka.streams:type=stream-rocksdb-window-metrics,client-
id=new-part-advice-d1094e71-0f59-45e8-98f4-477f9444aa91-StreamThread-1:
3679340.80748852
$>get -s -b kafka.streams:type=stream-rocksdb-window-metrics,client-
id=new-part-advice-d1094e71-0f59-45e8-98f4-477f9444aa91-StreamThread-1
key-table-put-rate
#mbean = kafka.streams:type=stream-rocksdb-window-metrics,client-
id=new-part-advice-d1094e71-0f59-45e8-98f4-477f9444aa91-StreamThread-1:
56.134778706069184
$>get -s -b kafka.streams:type=stream-rocksdb-window-metrics,client-
id=new-part-advice-d1094e71-0f59-45e8-98f4-477f9444aa91-StreamThread-1
key-table-fetch-rate
#mbean = kafka.streams:type=stream-rocksdb-window-metrics,client-
id=new-part-advice-d1094e71-0f59-45e8-98f4-477f9444aa91-StreamThread-1:
136.10721427931827

I notice that result in 10.2.0 is much worse than same for 10.1.1

I would like to know
1. Is there any benchmark on rocksdb as at what rate/latency it should be
doing put/fetch operations.

2. What could be the cause of inferior numbers in 10.2.0, is it because
this application is also running three other threads doing the same thing.

3. Also whats with the name new-part-advice-d1094e71-
0f59-45e8-98f4-477f9444aa91-StreamThread-1
    I wanted to put this as a part of my cronjob, so why can't we have
simpler name like we have in 10.1.1, so it is easy to write the script.

Thanks
Sachin

Re: Need some help in identifying some important metrics to monitor for streams

Posted by Eno Thereska <en...@gmail.com>.

Answer inline:

> 
> I just wanted to understand say in single poll request if it fetches n
> records does the above values indicate time computed for all n records or
> just a single record.
> 

In 0.10.2, the process latency is that of a single record, not the sum of n records. The commit latency is the latency for several requests. So your second statement is true:

> or is it the total average time to process these records = n * process
> latency + commit latency  before making another poll request.

Correct.

Thanks
Eno



> 
> Basically we just want to know how often is poll getting called just to see
> how close is it to MAX_POLL_INTERVAL_MS_CONFIG.
> 
> Thanks
> Sachin
> 
> 
> On Sun, Mar 5, 2017 at 11:42 AM, Guozhang Wang <wa...@gmail.com> wrote:
> 
>> That is right, since client-id is used as the metrics name which should be
>> distinguishable.
>> 
>> https://kafka.apache.org/documentation/#streamsconfigs (I think we can
>> improve on the explanation of the client.id config)
>> 
>> A common client-id could contain the machine's host-port; of course, if you
>> have more than one Streams instances running on the same machine that wont
>> work and you need to consider using more information.
>> 
>> Again the client-id config is not required, and when not specified Streams
>> will use an UUID suffix to achieve uniqueness but as you observed it is
>> less human readable for monitoring.
>> 
>> 
>> Guozhang
>> 
>> On Fri, Mar 3, 2017 at 5:18 PM, Sachin Mittal <sj...@gmail.com> wrote:
>> 
>>> Son if I am running my stream and across a cluster of different machine
>>> each machine should have a different client id.
>>> 
>>> On 4 Mar 2017 12:36 a.m., "Guozhang Wang" <wa...@gmail.com> wrote:
>>> 
>>>> Sachin,
>>>> 
>>>> The reason that you got metrics name as
>>>> 
>>>> new-part-advice-d1094e71-0f59-45e8-98f4-477f9444aa91-StreamThread-1
>>>> 
>>>> 
>>>> Is that you did not set the "CLIENT_ID_CONFIG" in your app, and
>>>> KafkaStreams have to use a default combo of "appID:
>>>> new-part-advice"-"processID: a UUID to guarantee uniqueness across
>>>> machines" as its clientId.
>>>> 
>>>> 
>>>> As for metricsName, it is always set as "clientId + "-" + threadName"
>>> where
>>>> "StreamThread-1" is your threadName which is unique WITHIN the JVM and
>>> that
>>>> is why we still need the globally unique clientId for distinguishment.
>>>> 
>>>> I just checked the source code and this logic was not changed from
>> 0.10.1
>>>> to 0.10.2, so I guess you set your clientId as "new-advice-1" as well
>> in
>>>> 0.10.1?
>>>> 
>>>> 
>>>> Guozhang
>>>> 
>>>> 
>>>> 
>>>> On Fri, Mar 3, 2017 at 4:02 AM, Eno Thereska <en...@gmail.com>
>>>> wrote:
>>>> 
>>>>> Hi Sachin,
>>>>> 
>>>>> Now that the confluent platform 3.2 is out, we also have some more
>>>>> documentation on this here: http://docs.confluent.io/3.2.
>>>>> 0/streams/monitoring.html <http://docs.confluent.io/3.2.
>>>>> 0/streams/monitoring.html>. We added a note on how to add other
>>> metrics.
>>>>> 
>>>>> Yeah, your calculation on poll time makes sense. The important
>> metrics
>>>> are
>>>>> the “info” ones that are on by default. However, for stageful
>>>> applications,
>>>>> if you suspect that state stores might be bottlenecking, you might
>> want
>>>> to
>>>>> collect those metrics too.
>>>>> 
>>>>> On the benchmarks, the one called “processstreamwithstatestore” and
>>>>> “count” are the closest to a benchmarking on RocksDb with the default
>>>>> configs. The first writes each record to RocksDb, while the second
>>>> performs
>>>>> simple aggregates (reads and writes from/to RocksDb).
>>>>> 
>>>>> We might need to add more benchmarks here, would be great to get some
>>>>> ideas and help from the community. E.g., a pure RocksDb benchmark
>> that
>>>>> doesn’t go through streams at all.
>>>>> 
>>>>> Could you open a JIRA on the name issue please? As an “improvement”.
>>>>> 
>>>>> Thanks
>>>>> Eno
>>>>> 
>>>>> 
>>>>> 
>>>>>> On Mar 2, 2017, at 6:00 PM, Sachin Mittal <sj...@gmail.com>
>>> wrote:
>>>>>> 
>>>>>> Hi,
>>>>>> I had checked the monitoring docs, but could not figure out which
>>>> metrics
>>>>>> are important ones.
>>>>>> 
>>>>>> Also mainly I am looking at the average time spent between 2
>>> successive
>>>>>> poll requests.
>>>>>> Can I say that average time between 2 poll requests is sum of
>>>>>> 
>>>>>> commit + poll + process + punctuate (latency-avg).
>>>>>> 
>>>>>> 
>>>>>> Also I checked the benchmark tests results but could not find any
>>>>>> information on rocksdb metrics for fetch and put operations.
>>>>>> Is there any benchmark for these or based on my values in previous
>>> mail
>>>>> can
>>>>>> something be commented on its performance.
>>>>>> 
>>>>>> 
>>>>>> Lastly can we get some help on names like
>>>> new-part-advice-d1094e71-0f59-
>>>>>> 45e8-98f4-477f9444aa91-StreamThread-1 and have more standard name
>> of
>>>>> thread
>>>>>> like new-advice-1-StreamThread-1(as in version 10.1.1) so we can
>> log
>>>>> these
>>>>>> metrics as part of out cron jobs.
>>>>>> 
>>>>>> Thanks
>>>>>> Sachin
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> On Thu, Mar 2, 2017 at 9:31 PM, Eno Thereska <
>> eno.thereska@gmail.com
>>>> 
>>>>> wrote:
>>>>>> 
>>>>>>> Hi Sachin,
>>>>>>> 
>>>>>>> The new streams metrics are now documented at
>>>> https://kafka.apache.org/
>>>>>>> documentation/#kafka_streams_monitoring <
>> https://kafka.apache.org/
>>>>>>> documentation/#kafka_streams_monitoring>. Note that not all of
>> them
>>>> are
>>>>>>> turned on by default.
>>>>>>> 
>>>>>>> We have several benchmarks that run nightly to monitor streams
>>>>>>> performance. They all stem from the SimpleBenchmark.java
>> benchmark.
>>> In
>>>>>>> addition, their results are published nightly here
>>>>>>> http://testing.confluent.io <http://testing.confluent.io/>,
>> (e.g.,
>>>>> under
>>>>>>> the trunk results). E.g., looking at today's results:
>>>>>>> http://confluent-kafka-system-test-results.s3-us-west-2.
>>>>>>> amazonaws.com/2017-03-02--001.1488449554--apache--trunk--
>>>>>>> ef92bb4/report.html <http://confluent-kafka-
>>>> system-test-results.s3-us-
>>>>>>> west-2.amazonaws.com/2017-03-02--001.1488449554--apache--
>>>>>>> trunk--ef92bb4/report.html>
>>>>>>> (if you search for "benchmarks.streams") you'll see results from a
>>>>> series
>>>>>>> of benchmarks, ranging from simply consuming, to simple topologies
>>>> with
>>>>> a
>>>>>>> source and sink, to joins and count aggregate. These run on AWS
>>>> nightly,
>>>>>>> but you can also run manually on your setup.
>>>>>>> 
>>>>>>> In addition, programmatically the code can check the
>>>>> KafkaStreams.state()
>>>>>>> and register listeners for when the state changes. For example,
>> the
>>>>> state
>>>>>>> can change from "running" to "rebalancing".
>>>>>>> 
>>>>>>> It is likely we'll need more metrics moving forward and would be
>>> great
>>>>> to
>>>>>>> get feedback from the community.
>>>>>>> 
>>>>>>> 
>>>>>>> Thanks
>>>>>>> Eno
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>>> On 2 Mar 2017, at 11:54, Sachin Mittal <sj...@gmail.com>
>> wrote:
>>>>>>>> 
>>>>>>>> Hello All,
>>>>>>>> I had few questions regarding monitoring of kafka streams
>>> application
>>>>> and
>>>>>>>> what are some important metrics we should collect in our case.
>>>>>>>> 
>>>>>>>> Just a brief overview, we have a single thread application
>>> (0.10.1.1)
>>>>>>>> reading from single partition topic and it is working all fine.
>>>>>>>> Then we have same application (using 0.10.2.0) multi threaded
>> with
>>> 4
>>>>>>>> threads per machine and 3 machines cluster setup reading for same
>>> but
>>>>>>>> partitioned topic (12 partitions).
>>>>>>>> Thus we have each thread processing single partition same case as
>>>>> earlier
>>>>>>>> one.
>>>>>>>> 
>>>>>>>> The new setup also works fine in steady state, but under load
>>> somehow
>>>>> it
>>>>>>>> triggers frequent re-balance and then we run into all sort of
>>> issues
>>>>> like
>>>>>>>> stream thread dying due to CommitFailedException or entering into
>>>>>>> deadlock
>>>>>>>> state.
>>>>>>>> After a while we restart all the instances then it works fine
>> for a
>>>>> while
>>>>>>>> and again we get the same problem and it goes on.
>>>>>>>> 
>>>>>>>> 1. So just to monitor, like when first thread fails what would be
>>>> some
>>>>>>>> important metrics we should be collecting to get some sense of
>>> whats
>>>>>>> going
>>>>>>>> on?
>>>>>>>> 
>>>>>>>> 2. Is there any metric that tells time elapsed between successive
>>>> poll
>>>>>>>> requests, so we can monitor that?
>>>>>>>> 
>>>>>>>> Also I did monitor rocksdb put and fetch times for these 2
>>> instances
>>>>> and
>>>>>>>> here is the output I get:
>>>>>>>> 0.10.1.1
>>>>>>>> $>get -s  -b kafka.streams:type=stream-
>>>> rocksdb-window-metrics,client-
>>>>>>> id=new-advice-1-StreamThread-1
>>>>>>>> key-table-put-avg-latency-ms
>>>>>>>> #mbean = kafka.streams:type=stream-
>> rocksdb-window-metrics,client-
>>>>>>>> id=new-advice-1-StreamThread-1:
>>>>>>>> 206431.7497615029
>>>>>>>> $>get -s  -b kafka.streams:type=stream-
>>>> rocksdb-window-metrics,client-
>>>>>>> id=new-advice-1-StreamThread-1
>>>>>>>> key-table-fetch-avg-latency-ms
>>>>>>>> #mbean = kafka.streams:type=stream-
>> rocksdb-window-metrics,client-
>>>>>>>> id=new-advice-1-StreamThread-1:
>>>>>>>> 2595394.2746129474
>>>>>>>> $>get -s  -b kafka.streams:type=stream-
>>>> rocksdb-window-metrics,client-
>>>>>>> id=new-advice-1-StreamThread-1
>>>>>>>> key-table-put-qps
>>>>>>>> #mbean = kafka.streams:type=stream-
>> rocksdb-window-metrics,client-
>>>>>>>> id=new-advice-1-StreamThread-1:
>>>>>>>> 232.86299499317252
>>>>>>>> $>get -s  -b kafka.streams:type=stream-
>>>> rocksdb-window-metrics,client-
>>>>>>> id=new-advice-1-StreamThread-1
>>>>>>>> key-table-fetch-qps
>>>>>>>> #mbean = kafka.streams:type=stream-
>> rocksdb-window-metrics,client-
>>>>>>>> id=new-advice-1-StreamThread-1:
>>>>>>>> 373.61071016166284
>>>>>>>> 
>>>>>>>> Same values for 0.10.2.0 I get
>>>>>>>> $>get -s -b kafka.streams:type=stream-
>>> rocksdb-window-metrics,client-
>>>>>>>> id=new-part-advice-d1094e71-0f59-45e8-98f4-477f9444aa91-
>>>> StreamThread-1
>>>>>>>> key-table-put-latency-avg
>>>>>>>> #mbean = kafka.streams:type=stream-
>> rocksdb-window-metrics,client-
>>>>>>>> id=new-part-advice-d1094e71-0f59-45e8-98f4-477f9444aa91-
>>>>> StreamThread-1:
>>>>>>>> 1199859.5535022356
>>>>>>>> $>get -s -b kafka.streams:type=stream-
>>> rocksdb-window-metrics,client-
>>>>>>>> id=new-part-advice-d1094e71-0f59-45e8-98f4-477f9444aa91-
>>>> StreamThread-1
>>>>>>>> key-table-fetch-latency-avg
>>>>>>>> #mbean = kafka.streams:type=stream-
>> rocksdb-window-metrics,client-
>>>>>>>> id=new-part-advice-d1094e71-0f59-45e8-98f4-477f9444aa91-
>>>>> StreamThread-1:
>>>>>>>> 3679340.80748852
>>>>>>>> $>get -s -b kafka.streams:type=stream-
>>> rocksdb-window-metrics,client-
>>>>>>>> id=new-part-advice-d1094e71-0f59-45e8-98f4-477f9444aa91-
>>>> StreamThread-1
>>>>>>>> key-table-put-rate
>>>>>>>> #mbean = kafka.streams:type=stream-
>> rocksdb-window-metrics,client-
>>>>>>>> id=new-part-advice-d1094e71-0f59-45e8-98f4-477f9444aa91-
>>>>> StreamThread-1:
>>>>>>>> 56.134778706069184
>>>>>>>> $>get -s -b kafka.streams:type=stream-
>>> rocksdb-window-metrics,client-
>>>>>>>> id=new-part-advice-d1094e71-0f59-45e8-98f4-477f9444aa91-
>>>> StreamThread-1
>>>>>>>> key-table-fetch-rate
>>>>>>>> #mbean = kafka.streams:type=stream-
>> rocksdb-window-metrics,client-
>>>>>>>> id=new-part-advice-d1094e71-0f59-45e8-98f4-477f9444aa91-
>>>>> StreamThread-1:
>>>>>>>> 136.10721427931827
>>>>>>>> 
>>>>>>>> I notice that result in 10.2.0 is much worse than same for 10.1.1
>>>>>>>> 
>>>>>>>> I would like to know
>>>>>>>> 1. Is there any benchmark on rocksdb as at what rate/latency it
>>>> should
>>>>> be
>>>>>>>> doing put/fetch operations.
>>>>>>>> 
>>>>>>>> 2. What could be the cause of inferior numbers in 10.2.0, is it
>>>> because
>>>>>>>> this application is also running three other threads doing the
>> same
>>>>>>> thing.
>>>>>>>> 
>>>>>>>> 3. Also whats with the name new-part-advice-d1094e71-
>>>>>>>> 0f59-45e8-98f4-477f9444aa91-StreamThread-1
>>>>>>>>  I wanted to put this as a part of my cronjob, so why can't we
>>> have
>>>>>>>> simpler name like we have in 10.1.1, so it is easy to write the
>>>> script.
>>>>>>>> 
>>>>>>>> Thanks
>>>>>>>> Sachin
>>>>>>> 
>>>>>>> 
>>>>> 
>>>>> 
>>>> 
>>>> 
>>>> --
>>>> -- Guozhang
>>>> 
>>> 
>> 
>> 
>> 
>> --
>> -- Guozhang
>>

Re: Need some help in identifying some important metrics to monitor for streams

Posted by Sachin Mittal <sj...@gmail.com>.

Yes setting client id works. Now we are able to add metrics as part of our
cron job.

One additional question I have is
http://kafka.apache.org/documentation.html#kafka_streams_monitoring

I am monitoring the commit latency and process latency.
Commit latency is usually say in 1000ms and process latency is usually say
in 1ms.
So 3 order of magnitude less than commit latency.

This makes sense because in our commit phase ie forEach we do some external
db operation.

I just wanted to understand say in single poll request if it fetches n
records does the above values indicate time computed for all n records or
just a single record.

or is it the total average time to process these records = n * process
latency + commit latency  before making another poll request.

Basically we just want to know how often is poll getting called just to see
how close is it to MAX_POLL_INTERVAL_MS_CONFIG.

Thanks
Sachin


On Sun, Mar 5, 2017 at 11:42 AM, Guozhang Wang <wa...@gmail.com> wrote:

> That is right, since client-id is used as the metrics name which should be
> distinguishable.
>
> https://kafka.apache.org/documentation/#streamsconfigs (I think we can
> improve on the explanation of the client.id config)
>
> A common client-id could contain the machine's host-port; of course, if you
> have more than one Streams instances running on the same machine that wont
> work and you need to consider using more information.
>
> Again the client-id config is not required, and when not specified Streams
> will use an UUID suffix to achieve uniqueness but as you observed it is
> less human readable for monitoring.
>
>
> Guozhang
>
> On Fri, Mar 3, 2017 at 5:18 PM, Sachin Mittal <sj...@gmail.com> wrote:
>
> > Son if I am running my stream and across a cluster of different machine
> > each machine should have a different client id.
> >
> > On 4 Mar 2017 12:36 a.m., "Guozhang Wang" <wa...@gmail.com> wrote:
> >
> > > Sachin,
> > >
> > > The reason that you got metrics name as
> > >
> > > new-part-advice-d1094e71-0f59-45e8-98f4-477f9444aa91-StreamThread-1
> > >
> > >
> > > Is that you did not set the "CLIENT_ID_CONFIG" in your app, and
> > > KafkaStreams have to use a default combo of "appID:
> > > new-part-advice"-"processID: a UUID to guarantee uniqueness across
> > > machines" as its clientId.
> > >
> > >
> > > As for metricsName, it is always set as "clientId + "-" + threadName"
> > where
> > > "StreamThread-1" is your threadName which is unique WITHIN the JVM and
> > that
> > > is why we still need the globally unique clientId for distinguishment.
> > >
> > > I just checked the source code and this logic was not changed from
> 0.10.1
> > > to 0.10.2, so I guess you set your clientId as "new-advice-1" as well
> in
> > > 0.10.1?
> > >
> > >
> > > Guozhang
> > >
> > >
> > >
> > > On Fri, Mar 3, 2017 at 4:02 AM, Eno Thereska <en...@gmail.com>
> > > wrote:
> > >
> > > > Hi Sachin,
> > > >
> > > > Now that the confluent platform 3.2 is out, we also have some more
> > > > documentation on this here: http://docs.confluent.io/3.2.
> > > > 0/streams/monitoring.html <http://docs.confluent.io/3.2.
> > > > 0/streams/monitoring.html>. We added a note on how to add other
> > metrics.
> > > >
> > > > Yeah, your calculation on poll time makes sense. The important
> metrics
> > > are
> > > > the “info” ones that are on by default. However, for stageful
> > > applications,
> > > > if you suspect that state stores might be bottlenecking, you might
> want
> > > to
> > > > collect those metrics too.
> > > >
> > > > On the benchmarks, the one called “processstreamwithstatestore” and
> > > > “count” are the closest to a benchmarking on RocksDb with the default
> > > > configs. The first writes each record to RocksDb, while the second
> > > performs
> > > > simple aggregates (reads and writes from/to RocksDb).
> > > >
> > > > We might need to add more benchmarks here, would be great to get some
> > > > ideas and help from the community. E.g., a pure RocksDb benchmark
> that
> > > > doesn’t go through streams at all.
> > > >
> > > > Could you open a JIRA on the name issue please? As an “improvement”.
> > > >
> > > > Thanks
> > > > Eno
> > > >
> > > >
> > > >
> > > > > On Mar 2, 2017, at 6:00 PM, Sachin Mittal <sj...@gmail.com>
> > wrote:
> > > > >
> > > > > Hi,
> > > > > I had checked the monitoring docs, but could not figure out which
> > > metrics
> > > > > are important ones.
> > > > >
> > > > > Also mainly I am looking at the average time spent between 2
> > successive
> > > > > poll requests.
> > > > > Can I say that average time between 2 poll requests is sum of
> > > > >
> > > > > commit + poll + process + punctuate (latency-avg).
> > > > >
> > > > >
> > > > > Also I checked the benchmark tests results but could not find any
> > > > > information on rocksdb metrics for fetch and put operations.
> > > > > Is there any benchmark for these or based on my values in previous
> > mail
> > > > can
> > > > > something be commented on its performance.
> > > > >
> > > > >
> > > > > Lastly can we get some help on names like
> > > new-part-advice-d1094e71-0f59-
> > > > > 45e8-98f4-477f9444aa91-StreamThread-1 and have more standard name
> of
> > > > thread
> > > > > like new-advice-1-StreamThread-1(as in version 10.1.1) so we can
> log
> > > > these
> > > > > metrics as part of out cron jobs.
> > > > >
> > > > > Thanks
> > > > > Sachin
> > > > >
> > > > >
> > > > >
> > > > > On Thu, Mar 2, 2017 at 9:31 PM, Eno Thereska <
> eno.thereska@gmail.com
> > >
> > > > wrote:
> > > > >
> > > > >> Hi Sachin,
> > > > >>
> > > > >> The new streams metrics are now documented at
> > > https://kafka.apache.org/
> > > > >> documentation/#kafka_streams_monitoring <
> https://kafka.apache.org/
> > > > >> documentation/#kafka_streams_monitoring>. Note that not all of
> them
> > > are
> > > > >> turned on by default.
> > > > >>
> > > > >> We have several benchmarks that run nightly to monitor streams
> > > > >> performance. They all stem from the SimpleBenchmark.java
> benchmark.
> > In
> > > > >> addition, their results are published nightly here
> > > > >> http://testing.confluent.io <http://testing.confluent.io/>,
> (e.g.,
> > > > under
> > > > >> the trunk results). E.g., looking at today's results:
> > > > >> http://confluent-kafka-system-test-results.s3-us-west-2.
> > > > >> amazonaws.com/2017-03-02--001.1488449554--apache--trunk--
> > > > >> ef92bb4/report.html <http://confluent-kafka-
> > > system-test-results.s3-us-
> > > > >> west-2.amazonaws.com/2017-03-02--001.1488449554--apache--
> > > > >> trunk--ef92bb4/report.html>
> > > > >> (if you search for "benchmarks.streams") you'll see results from a
> > > > series
> > > > >> of benchmarks, ranging from simply consuming, to simple topologies
> > > with
> > > > a
> > > > >> source and sink, to joins and count aggregate. These run on AWS
> > > nightly,
> > > > >> but you can also run manually on your setup.
> > > > >>
> > > > >> In addition, programmatically the code can check the
> > > > KafkaStreams.state()
> > > > >> and register listeners for when the state changes. For example,
> the
> > > > state
> > > > >> can change from "running" to "rebalancing".
> > > > >>
> > > > >> It is likely we'll need more metrics moving forward and would be
> > great
> > > > to
> > > > >> get feedback from the community.
> > > > >>
> > > > >>
> > > > >> Thanks
> > > > >> Eno
> > > > >>
> > > > >>
> > > > >>
> > > > >>
> > > > >>> On 2 Mar 2017, at 11:54, Sachin Mittal <sj...@gmail.com>
> wrote:
> > > > >>>
> > > > >>> Hello All,
> > > > >>> I had few questions regarding monitoring of kafka streams
> > application
> > > > and
> > > > >>> what are some important metrics we should collect in our case.
> > > > >>>
> > > > >>> Just a brief overview, we have a single thread application
> > (0.10.1.1)
> > > > >>> reading from single partition topic and it is working all fine.
> > > > >>> Then we have same application (using 0.10.2.0) multi threaded
> with
> > 4
> > > > >>> threads per machine and 3 machines cluster setup reading for same
> > but
> > > > >>> partitioned topic (12 partitions).
> > > > >>> Thus we have each thread processing single partition same case as
> > > > earlier
> > > > >>> one.
> > > > >>>
> > > > >>> The new setup also works fine in steady state, but under load
> > somehow
> > > > it
> > > > >>> triggers frequent re-balance and then we run into all sort of
> > issues
> > > > like
> > > > >>> stream thread dying due to CommitFailedException or entering into
> > > > >> deadlock
> > > > >>> state.
> > > > >>> After a while we restart all the instances then it works fine
> for a
> > > > while
> > > > >>> and again we get the same problem and it goes on.
> > > > >>>
> > > > >>> 1. So just to monitor, like when first thread fails what would be
> > > some
> > > > >>> important metrics we should be collecting to get some sense of
> > whats
> > > > >> going
> > > > >>> on?
> > > > >>>
> > > > >>> 2. Is there any metric that tells time elapsed between successive
> > > poll
> > > > >>> requests, so we can monitor that?
> > > > >>>
> > > > >>> Also I did monitor rocksdb put and fetch times for these 2
> > instances
> > > > and
> > > > >>> here is the output I get:
> > > > >>> 0.10.1.1
> > > > >>> $>get -s  -b kafka.streams:type=stream-
> > > rocksdb-window-metrics,client-
> > > > >> id=new-advice-1-StreamThread-1
> > > > >>> key-table-put-avg-latency-ms
> > > > >>> #mbean = kafka.streams:type=stream-
> rocksdb-window-metrics,client-
> > > > >>> id=new-advice-1-StreamThread-1:
> > > > >>> 206431.7497615029
> > > > >>> $>get -s  -b kafka.streams:type=stream-
> > > rocksdb-window-metrics,client-
> > > > >> id=new-advice-1-StreamThread-1
> > > > >>> key-table-fetch-avg-latency-ms
> > > > >>> #mbean = kafka.streams:type=stream-
> rocksdb-window-metrics,client-
> > > > >>> id=new-advice-1-StreamThread-1:
> > > > >>> 2595394.2746129474
> > > > >>> $>get -s  -b kafka.streams:type=stream-
> > > rocksdb-window-metrics,client-
> > > > >> id=new-advice-1-StreamThread-1
> > > > >>> key-table-put-qps
> > > > >>> #mbean = kafka.streams:type=stream-
> rocksdb-window-metrics,client-
> > > > >>> id=new-advice-1-StreamThread-1:
> > > > >>> 232.86299499317252
> > > > >>> $>get -s  -b kafka.streams:type=stream-
> > > rocksdb-window-metrics,client-
> > > > >> id=new-advice-1-StreamThread-1
> > > > >>> key-table-fetch-qps
> > > > >>> #mbean = kafka.streams:type=stream-
> rocksdb-window-metrics,client-
> > > > >>> id=new-advice-1-StreamThread-1:
> > > > >>> 373.61071016166284
> > > > >>>
> > > > >>> Same values for 0.10.2.0 I get
> > > > >>> $>get -s -b kafka.streams:type=stream-
> > rocksdb-window-metrics,client-
> > > > >>> id=new-part-advice-d1094e71-0f59-45e8-98f4-477f9444aa91-
> > > StreamThread-1
> > > > >>> key-table-put-latency-avg
> > > > >>> #mbean = kafka.streams:type=stream-
> rocksdb-window-metrics,client-
> > > > >>> id=new-part-advice-d1094e71-0f59-45e8-98f4-477f9444aa91-
> > > > StreamThread-1:
> > > > >>> 1199859.5535022356
> > > > >>> $>get -s -b kafka.streams:type=stream-
> > rocksdb-window-metrics,client-
> > > > >>> id=new-part-advice-d1094e71-0f59-45e8-98f4-477f9444aa91-
> > > StreamThread-1
> > > > >>> key-table-fetch-latency-avg
> > > > >>> #mbean = kafka.streams:type=stream-
> rocksdb-window-metrics,client-
> > > > >>> id=new-part-advice-d1094e71-0f59-45e8-98f4-477f9444aa91-
> > > > StreamThread-1:
> > > > >>> 3679340.80748852
> > > > >>> $>get -s -b kafka.streams:type=stream-
> > rocksdb-window-metrics,client-
> > > > >>> id=new-part-advice-d1094e71-0f59-45e8-98f4-477f9444aa91-
> > > StreamThread-1
> > > > >>> key-table-put-rate
> > > > >>> #mbean = kafka.streams:type=stream-
> rocksdb-window-metrics,client-
> > > > >>> id=new-part-advice-d1094e71-0f59-45e8-98f4-477f9444aa91-
> > > > StreamThread-1:
> > > > >>> 56.134778706069184
> > > > >>> $>get -s -b kafka.streams:type=stream-
> > rocksdb-window-metrics,client-
> > > > >>> id=new-part-advice-d1094e71-0f59-45e8-98f4-477f9444aa91-
> > > StreamThread-1
> > > > >>> key-table-fetch-rate
> > > > >>> #mbean = kafka.streams:type=stream-
> rocksdb-window-metrics,client-
> > > > >>> id=new-part-advice-d1094e71-0f59-45e8-98f4-477f9444aa91-
> > > > StreamThread-1:
> > > > >>> 136.10721427931827
> > > > >>>
> > > > >>> I notice that result in 10.2.0 is much worse than same for 10.1.1
> > > > >>>
> > > > >>> I would like to know
> > > > >>> 1. Is there any benchmark on rocksdb as at what rate/latency it
> > > should
> > > > be
> > > > >>> doing put/fetch operations.
> > > > >>>
> > > > >>> 2. What could be the cause of inferior numbers in 10.2.0, is it
> > > because
> > > > >>> this application is also running three other threads doing the
> same
> > > > >> thing.
> > > > >>>
> > > > >>> 3. Also whats with the name new-part-advice-d1094e71-
> > > > >>> 0f59-45e8-98f4-477f9444aa91-StreamThread-1
> > > > >>>   I wanted to put this as a part of my cronjob, so why can't we
> > have
> > > > >>> simpler name like we have in 10.1.1, so it is easy to write the
> > > script.
> > > > >>>
> > > > >>> Thanks
> > > > >>> Sachin
> > > > >>
> > > > >>
> > > >
> > > >
> > >
> > >
> > > --
> > > -- Guozhang
> > >
> >
>
>
>
> --
> -- Guozhang
>

Re: Need some help in identifying some important metrics to monitor for streams

Posted by Guozhang Wang <wa...@gmail.com>.

That is right, since client-id is used as the metrics name which should be
distinguishable.

https://kafka.apache.org/documentation/#streamsconfigs (I think we can
improve on the explanation of the client.id config)

A common client-id could contain the machine's host-port; of course, if you
have more than one Streams instances running on the same machine that wont
work and you need to consider using more information.

Again the client-id config is not required, and when not specified Streams
will use an UUID suffix to achieve uniqueness but as you observed it is
less human readable for monitoring.


Guozhang

On Fri, Mar 3, 2017 at 5:18 PM, Sachin Mittal <sj...@gmail.com> wrote:

> Son if I am running my stream and across a cluster of different machine
> each machine should have a different client id.
>
> On 4 Mar 2017 12:36 a.m., "Guozhang Wang" <wa...@gmail.com> wrote:
>
> > Sachin,
> >
> > The reason that you got metrics name as
> >
> > new-part-advice-d1094e71-0f59-45e8-98f4-477f9444aa91-StreamThread-1
> >
> >
> > Is that you did not set the "CLIENT_ID_CONFIG" in your app, and
> > KafkaStreams have to use a default combo of "appID:
> > new-part-advice"-"processID: a UUID to guarantee uniqueness across
> > machines" as its clientId.
> >
> >
> > As for metricsName, it is always set as "clientId + "-" + threadName"
> where
> > "StreamThread-1" is your threadName which is unique WITHIN the JVM and
> that
> > is why we still need the globally unique clientId for distinguishment.
> >
> > I just checked the source code and this logic was not changed from 0.10.1
> > to 0.10.2, so I guess you set your clientId as "new-advice-1" as well in
> > 0.10.1?
> >
> >
> > Guozhang
> >
> >
> >
> > On Fri, Mar 3, 2017 at 4:02 AM, Eno Thereska <en...@gmail.com>
> > wrote:
> >
> > > Hi Sachin,
> > >
> > > Now that the confluent platform 3.2 is out, we also have some more
> > > documentation on this here: http://docs.confluent.io/3.2.
> > > 0/streams/monitoring.html <http://docs.confluent.io/3.2.
> > > 0/streams/monitoring.html>. We added a note on how to add other
> metrics.
> > >
> > > Yeah, your calculation on poll time makes sense. The important metrics
> > are
> > > the “info” ones that are on by default. However, for stageful
> > applications,
> > > if you suspect that state stores might be bottlenecking, you might want
> > to
> > > collect those metrics too.
> > >
> > > On the benchmarks, the one called “processstreamwithstatestore” and
> > > “count” are the closest to a benchmarking on RocksDb with the default
> > > configs. The first writes each record to RocksDb, while the second
> > performs
> > > simple aggregates (reads and writes from/to RocksDb).
> > >
> > > We might need to add more benchmarks here, would be great to get some
> > > ideas and help from the community. E.g., a pure RocksDb benchmark that
> > > doesn’t go through streams at all.
> > >
> > > Could you open a JIRA on the name issue please? As an “improvement”.
> > >
> > > Thanks
> > > Eno
> > >
> > >
> > >
> > > > On Mar 2, 2017, at 6:00 PM, Sachin Mittal <sj...@gmail.com>
> wrote:
> > > >
> > > > Hi,
> > > > I had checked the monitoring docs, but could not figure out which
> > metrics
> > > > are important ones.
> > > >
> > > > Also mainly I am looking at the average time spent between 2
> successive
> > > > poll requests.
> > > > Can I say that average time between 2 poll requests is sum of
> > > >
> > > > commit + poll + process + punctuate (latency-avg).
> > > >
> > > >
> > > > Also I checked the benchmark tests results but could not find any
> > > > information on rocksdb metrics for fetch and put operations.
> > > > Is there any benchmark for these or based on my values in previous
> mail
> > > can
> > > > something be commented on its performance.
> > > >
> > > >
> > > > Lastly can we get some help on names like
> > new-part-advice-d1094e71-0f59-
> > > > 45e8-98f4-477f9444aa91-StreamThread-1 and have more standard name of
> > > thread
> > > > like new-advice-1-StreamThread-1(as in version 10.1.1) so we can log
> > > these
> > > > metrics as part of out cron jobs.
> > > >
> > > > Thanks
> > > > Sachin
> > > >
> > > >
> > > >
> > > > On Thu, Mar 2, 2017 at 9:31 PM, Eno Thereska <eno.thereska@gmail.com
> >
> > > wrote:
> > > >
> > > >> Hi Sachin,
> > > >>
> > > >> The new streams metrics are now documented at
> > https://kafka.apache.org/
> > > >> documentation/#kafka_streams_monitoring <https://kafka.apache.org/
> > > >> documentation/#kafka_streams_monitoring>. Note that not all of them
> > are
> > > >> turned on by default.
> > > >>
> > > >> We have several benchmarks that run nightly to monitor streams
> > > >> performance. They all stem from the SimpleBenchmark.java benchmark.
> In
> > > >> addition, their results are published nightly here
> > > >> http://testing.confluent.io <http://testing.confluent.io/>, (e.g.,
> > > under
> > > >> the trunk results). E.g., looking at today's results:
> > > >> http://confluent-kafka-system-test-results.s3-us-west-2.
> > > >> amazonaws.com/2017-03-02--001.1488449554--apache--trunk--
> > > >> ef92bb4/report.html <http://confluent-kafka-
> > system-test-results.s3-us-
> > > >> west-2.amazonaws.com/2017-03-02--001.1488449554--apache--
> > > >> trunk--ef92bb4/report.html>
> > > >> (if you search for "benchmarks.streams") you'll see results from a
> > > series
> > > >> of benchmarks, ranging from simply consuming, to simple topologies
> > with
> > > a
> > > >> source and sink, to joins and count aggregate. These run on AWS
> > nightly,
> > > >> but you can also run manually on your setup.
> > > >>
> > > >> In addition, programmatically the code can check the
> > > KafkaStreams.state()
> > > >> and register listeners for when the state changes. For example, the
> > > state
> > > >> can change from "running" to "rebalancing".
> > > >>
> > > >> It is likely we'll need more metrics moving forward and would be
> great
> > > to
> > > >> get feedback from the community.
> > > >>
> > > >>
> > > >> Thanks
> > > >> Eno
> > > >>
> > > >>
> > > >>
> > > >>
> > > >>> On 2 Mar 2017, at 11:54, Sachin Mittal <sj...@gmail.com> wrote:
> > > >>>
> > > >>> Hello All,
> > > >>> I had few questions regarding monitoring of kafka streams
> application
> > > and
> > > >>> what are some important metrics we should collect in our case.
> > > >>>
> > > >>> Just a brief overview, we have a single thread application
> (0.10.1.1)
> > > >>> reading from single partition topic and it is working all fine.
> > > >>> Then we have same application (using 0.10.2.0) multi threaded with
> 4
> > > >>> threads per machine and 3 machines cluster setup reading for same
> but
> > > >>> partitioned topic (12 partitions).
> > > >>> Thus we have each thread processing single partition same case as
> > > earlier
> > > >>> one.
> > > >>>
> > > >>> The new setup also works fine in steady state, but under load
> somehow
> > > it
> > > >>> triggers frequent re-balance and then we run into all sort of
> issues
> > > like
> > > >>> stream thread dying due to CommitFailedException or entering into
> > > >> deadlock
> > > >>> state.
> > > >>> After a while we restart all the instances then it works fine for a
> > > while
> > > >>> and again we get the same problem and it goes on.
> > > >>>
> > > >>> 1. So just to monitor, like when first thread fails what would be
> > some
> > > >>> important metrics we should be collecting to get some sense of
> whats
> > > >> going
> > > >>> on?
> > > >>>
> > > >>> 2. Is there any metric that tells time elapsed between successive
> > poll
> > > >>> requests, so we can monitor that?
> > > >>>
> > > >>> Also I did monitor rocksdb put and fetch times for these 2
> instances
> > > and
> > > >>> here is the output I get:
> > > >>> 0.10.1.1
> > > >>> $>get -s  -b kafka.streams:type=stream-
> > rocksdb-window-metrics,client-
> > > >> id=new-advice-1-StreamThread-1
> > > >>> key-table-put-avg-latency-ms
> > > >>> #mbean = kafka.streams:type=stream-rocksdb-window-metrics,client-
> > > >>> id=new-advice-1-StreamThread-1:
> > > >>> 206431.7497615029
> > > >>> $>get -s  -b kafka.streams:type=stream-
> > rocksdb-window-metrics,client-
> > > >> id=new-advice-1-StreamThread-1
> > > >>> key-table-fetch-avg-latency-ms
> > > >>> #mbean = kafka.streams:type=stream-rocksdb-window-metrics,client-
> > > >>> id=new-advice-1-StreamThread-1:
> > > >>> 2595394.2746129474
> > > >>> $>get -s  -b kafka.streams:type=stream-
> > rocksdb-window-metrics,client-
> > > >> id=new-advice-1-StreamThread-1
> > > >>> key-table-put-qps
> > > >>> #mbean = kafka.streams:type=stream-rocksdb-window-metrics,client-
> > > >>> id=new-advice-1-StreamThread-1:
> > > >>> 232.86299499317252
> > > >>> $>get -s  -b kafka.streams:type=stream-
> > rocksdb-window-metrics,client-
> > > >> id=new-advice-1-StreamThread-1
> > > >>> key-table-fetch-qps
> > > >>> #mbean = kafka.streams:type=stream-rocksdb-window-metrics,client-
> > > >>> id=new-advice-1-StreamThread-1:
> > > >>> 373.61071016166284
> > > >>>
> > > >>> Same values for 0.10.2.0 I get
> > > >>> $>get -s -b kafka.streams:type=stream-
> rocksdb-window-metrics,client-
> > > >>> id=new-part-advice-d1094e71-0f59-45e8-98f4-477f9444aa91-
> > StreamThread-1
> > > >>> key-table-put-latency-avg
> > > >>> #mbean = kafka.streams:type=stream-rocksdb-window-metrics,client-
> > > >>> id=new-part-advice-d1094e71-0f59-45e8-98f4-477f9444aa91-
> > > StreamThread-1:
> > > >>> 1199859.5535022356
> > > >>> $>get -s -b kafka.streams:type=stream-
> rocksdb-window-metrics,client-
> > > >>> id=new-part-advice-d1094e71-0f59-45e8-98f4-477f9444aa91-
> > StreamThread-1
> > > >>> key-table-fetch-latency-avg
> > > >>> #mbean = kafka.streams:type=stream-rocksdb-window-metrics,client-
> > > >>> id=new-part-advice-d1094e71-0f59-45e8-98f4-477f9444aa91-
> > > StreamThread-1:
> > > >>> 3679340.80748852
> > > >>> $>get -s -b kafka.streams:type=stream-
> rocksdb-window-metrics,client-
> > > >>> id=new-part-advice-d1094e71-0f59-45e8-98f4-477f9444aa91-
> > StreamThread-1
> > > >>> key-table-put-rate
> > > >>> #mbean = kafka.streams:type=stream-rocksdb-window-metrics,client-
> > > >>> id=new-part-advice-d1094e71-0f59-45e8-98f4-477f9444aa91-
> > > StreamThread-1:
> > > >>> 56.134778706069184
> > > >>> $>get -s -b kafka.streams:type=stream-
> rocksdb-window-metrics,client-
> > > >>> id=new-part-advice-d1094e71-0f59-45e8-98f4-477f9444aa91-
> > StreamThread-1
> > > >>> key-table-fetch-rate
> > > >>> #mbean = kafka.streams:type=stream-rocksdb-window-metrics,client-
> > > >>> id=new-part-advice-d1094e71-0f59-45e8-98f4-477f9444aa91-
> > > StreamThread-1:
> > > >>> 136.10721427931827
> > > >>>
> > > >>> I notice that result in 10.2.0 is much worse than same for 10.1.1
> > > >>>
> > > >>> I would like to know
> > > >>> 1. Is there any benchmark on rocksdb as at what rate/latency it
> > should
> > > be
> > > >>> doing put/fetch operations.
> > > >>>
> > > >>> 2. What could be the cause of inferior numbers in 10.2.0, is it
> > because
> > > >>> this application is also running three other threads doing the same
> > > >> thing.
> > > >>>
> > > >>> 3. Also whats with the name new-part-advice-d1094e71-
> > > >>> 0f59-45e8-98f4-477f9444aa91-StreamThread-1
> > > >>>   I wanted to put this as a part of my cronjob, so why can't we
> have
> > > >>> simpler name like we have in 10.1.1, so it is easy to write the
> > script.
> > > >>>
> > > >>> Thanks
> > > >>> Sachin
> > > >>
> > > >>
> > >
> > >
> >
> >
> > --
> > -- Guozhang
> >
>



-- 
-- Guozhang

Re: Need some help in identifying some important metrics to monitor for streams

Posted by Sachin Mittal <sj...@gmail.com>.

Son if I am running my stream and across a cluster of different machine
each machine should have a different client id.

On 4 Mar 2017 12:36 a.m., "Guozhang Wang" <wa...@gmail.com> wrote:

> Sachin,
>
> The reason that you got metrics name as
>
> new-part-advice-d1094e71-0f59-45e8-98f4-477f9444aa91-StreamThread-1
>
>
> Is that you did not set the "CLIENT_ID_CONFIG" in your app, and
> KafkaStreams have to use a default combo of "appID:
> new-part-advice"-"processID: a UUID to guarantee uniqueness across
> machines" as its clientId.
>
>
> As for metricsName, it is always set as "clientId + "-" + threadName" where
> "StreamThread-1" is your threadName which is unique WITHIN the JVM and that
> is why we still need the globally unique clientId for distinguishment.
>
> I just checked the source code and this logic was not changed from 0.10.1
> to 0.10.2, so I guess you set your clientId as "new-advice-1" as well in
> 0.10.1?
>
>
> Guozhang
>
>
>
> On Fri, Mar 3, 2017 at 4:02 AM, Eno Thereska <en...@gmail.com>
> wrote:
>
> > Hi Sachin,
> >
> > Now that the confluent platform 3.2 is out, we also have some more
> > documentation on this here: http://docs.confluent.io/3.2.
> > 0/streams/monitoring.html <http://docs.confluent.io/3.2.
> > 0/streams/monitoring.html>. We added a note on how to add other metrics.
> >
> > Yeah, your calculation on poll time makes sense. The important metrics
> are
> > the “info” ones that are on by default. However, for stageful
> applications,
> > if you suspect that state stores might be bottlenecking, you might want
> to
> > collect those metrics too.
> >
> > On the benchmarks, the one called “processstreamwithstatestore” and
> > “count” are the closest to a benchmarking on RocksDb with the default
> > configs. The first writes each record to RocksDb, while the second
> performs
> > simple aggregates (reads and writes from/to RocksDb).
> >
> > We might need to add more benchmarks here, would be great to get some
> > ideas and help from the community. E.g., a pure RocksDb benchmark that
> > doesn’t go through streams at all.
> >
> > Could you open a JIRA on the name issue please? As an “improvement”.
> >
> > Thanks
> > Eno
> >
> >
> >
> > > On Mar 2, 2017, at 6:00 PM, Sachin Mittal <sj...@gmail.com> wrote:
> > >
> > > Hi,
> > > I had checked the monitoring docs, but could not figure out which
> metrics
> > > are important ones.
> > >
> > > Also mainly I am looking at the average time spent between 2 successive
> > > poll requests.
> > > Can I say that average time between 2 poll requests is sum of
> > >
> > > commit + poll + process + punctuate (latency-avg).
> > >
> > >
> > > Also I checked the benchmark tests results but could not find any
> > > information on rocksdb metrics for fetch and put operations.
> > > Is there any benchmark for these or based on my values in previous mail
> > can
> > > something be commented on its performance.
> > >
> > >
> > > Lastly can we get some help on names like
> new-part-advice-d1094e71-0f59-
> > > 45e8-98f4-477f9444aa91-StreamThread-1 and have more standard name of
> > thread
> > > like new-advice-1-StreamThread-1(as in version 10.1.1) so we can log
> > these
> > > metrics as part of out cron jobs.
> > >
> > > Thanks
> > > Sachin
> > >
> > >
> > >
> > > On Thu, Mar 2, 2017 at 9:31 PM, Eno Thereska <en...@gmail.com>
> > wrote:
> > >
> > >> Hi Sachin,
> > >>
> > >> The new streams metrics are now documented at
> https://kafka.apache.org/
> > >> documentation/#kafka_streams_monitoring <https://kafka.apache.org/
> > >> documentation/#kafka_streams_monitoring>. Note that not all of them
> are
> > >> turned on by default.
> > >>
> > >> We have several benchmarks that run nightly to monitor streams
> > >> performance. They all stem from the SimpleBenchmark.java benchmark. In
> > >> addition, their results are published nightly here
> > >> http://testing.confluent.io <http://testing.confluent.io/>, (e.g.,
> > under
> > >> the trunk results). E.g., looking at today's results:
> > >> http://confluent-kafka-system-test-results.s3-us-west-2.
> > >> amazonaws.com/2017-03-02--001.1488449554--apache--trunk--
> > >> ef92bb4/report.html <http://confluent-kafka-
> system-test-results.s3-us-
> > >> west-2.amazonaws.com/2017-03-02--001.1488449554--apache--
> > >> trunk--ef92bb4/report.html>
> > >> (if you search for "benchmarks.streams") you'll see results from a
> > series
> > >> of benchmarks, ranging from simply consuming, to simple topologies
> with
> > a
> > >> source and sink, to joins and count aggregate. These run on AWS
> nightly,
> > >> but you can also run manually on your setup.
> > >>
> > >> In addition, programmatically the code can check the
> > KafkaStreams.state()
> > >> and register listeners for when the state changes. For example, the
> > state
> > >> can change from "running" to "rebalancing".
> > >>
> > >> It is likely we'll need more metrics moving forward and would be great
> > to
> > >> get feedback from the community.
> > >>
> > >>
> > >> Thanks
> > >> Eno
> > >>
> > >>
> > >>
> > >>
> > >>> On 2 Mar 2017, at 11:54, Sachin Mittal <sj...@gmail.com> wrote:
> > >>>
> > >>> Hello All,
> > >>> I had few questions regarding monitoring of kafka streams application
> > and
> > >>> what are some important metrics we should collect in our case.
> > >>>
> > >>> Just a brief overview, we have a single thread application (0.10.1.1)
> > >>> reading from single partition topic and it is working all fine.
> > >>> Then we have same application (using 0.10.2.0) multi threaded with 4
> > >>> threads per machine and 3 machines cluster setup reading for same but
> > >>> partitioned topic (12 partitions).
> > >>> Thus we have each thread processing single partition same case as
> > earlier
> > >>> one.
> > >>>
> > >>> The new setup also works fine in steady state, but under load somehow
> > it
> > >>> triggers frequent re-balance and then we run into all sort of issues
> > like
> > >>> stream thread dying due to CommitFailedException or entering into
> > >> deadlock
> > >>> state.
> > >>> After a while we restart all the instances then it works fine for a
> > while
> > >>> and again we get the same problem and it goes on.
> > >>>
> > >>> 1. So just to monitor, like when first thread fails what would be
> some
> > >>> important metrics we should be collecting to get some sense of whats
> > >> going
> > >>> on?
> > >>>
> > >>> 2. Is there any metric that tells time elapsed between successive
> poll
> > >>> requests, so we can monitor that?
> > >>>
> > >>> Also I did monitor rocksdb put and fetch times for these 2 instances
> > and
> > >>> here is the output I get:
> > >>> 0.10.1.1
> > >>> $>get -s  -b kafka.streams:type=stream-
> rocksdb-window-metrics,client-
> > >> id=new-advice-1-StreamThread-1
> > >>> key-table-put-avg-latency-ms
> > >>> #mbean = kafka.streams:type=stream-rocksdb-window-metrics,client-
> > >>> id=new-advice-1-StreamThread-1:
> > >>> 206431.7497615029
> > >>> $>get -s  -b kafka.streams:type=stream-
> rocksdb-window-metrics,client-
> > >> id=new-advice-1-StreamThread-1
> > >>> key-table-fetch-avg-latency-ms
> > >>> #mbean = kafka.streams:type=stream-rocksdb-window-metrics,client-
> > >>> id=new-advice-1-StreamThread-1:
> > >>> 2595394.2746129474
> > >>> $>get -s  -b kafka.streams:type=stream-
> rocksdb-window-metrics,client-
> > >> id=new-advice-1-StreamThread-1
> > >>> key-table-put-qps
> > >>> #mbean = kafka.streams:type=stream-rocksdb-window-metrics,client-
> > >>> id=new-advice-1-StreamThread-1:
> > >>> 232.86299499317252
> > >>> $>get -s  -b kafka.streams:type=stream-
> rocksdb-window-metrics,client-
> > >> id=new-advice-1-StreamThread-1
> > >>> key-table-fetch-qps
> > >>> #mbean = kafka.streams:type=stream-rocksdb-window-metrics,client-
> > >>> id=new-advice-1-StreamThread-1:
> > >>> 373.61071016166284
> > >>>
> > >>> Same values for 0.10.2.0 I get
> > >>> $>get -s -b kafka.streams:type=stream-rocksdb-window-metrics,client-
> > >>> id=new-part-advice-d1094e71-0f59-45e8-98f4-477f9444aa91-
> StreamThread-1
> > >>> key-table-put-latency-avg
> > >>> #mbean = kafka.streams:type=stream-rocksdb-window-metrics,client-
> > >>> id=new-part-advice-d1094e71-0f59-45e8-98f4-477f9444aa91-
> > StreamThread-1:
> > >>> 1199859.5535022356
> > >>> $>get -s -b kafka.streams:type=stream-rocksdb-window-metrics,client-
> > >>> id=new-part-advice-d1094e71-0f59-45e8-98f4-477f9444aa91-
> StreamThread-1
> > >>> key-table-fetch-latency-avg
> > >>> #mbean = kafka.streams:type=stream-rocksdb-window-metrics,client-
> > >>> id=new-part-advice-d1094e71-0f59-45e8-98f4-477f9444aa91-
> > StreamThread-1:
> > >>> 3679340.80748852
> > >>> $>get -s -b kafka.streams:type=stream-rocksdb-window-metrics,client-
> > >>> id=new-part-advice-d1094e71-0f59-45e8-98f4-477f9444aa91-
> StreamThread-1
> > >>> key-table-put-rate
> > >>> #mbean = kafka.streams:type=stream-rocksdb-window-metrics,client-
> > >>> id=new-part-advice-d1094e71-0f59-45e8-98f4-477f9444aa91-
> > StreamThread-1:
> > >>> 56.134778706069184
> > >>> $>get -s -b kafka.streams:type=stream-rocksdb-window-metrics,client-
> > >>> id=new-part-advice-d1094e71-0f59-45e8-98f4-477f9444aa91-
> StreamThread-1
> > >>> key-table-fetch-rate
> > >>> #mbean = kafka.streams:type=stream-rocksdb-window-metrics,client-
> > >>> id=new-part-advice-d1094e71-0f59-45e8-98f4-477f9444aa91-
> > StreamThread-1:
> > >>> 136.10721427931827
> > >>>
> > >>> I notice that result in 10.2.0 is much worse than same for 10.1.1
> > >>>
> > >>> I would like to know
> > >>> 1. Is there any benchmark on rocksdb as at what rate/latency it
> should
> > be
> > >>> doing put/fetch operations.
> > >>>
> > >>> 2. What could be the cause of inferior numbers in 10.2.0, is it
> because
> > >>> this application is also running three other threads doing the same
> > >> thing.
> > >>>
> > >>> 3. Also whats with the name new-part-advice-d1094e71-
> > >>> 0f59-45e8-98f4-477f9444aa91-StreamThread-1
> > >>>   I wanted to put this as a part of my cronjob, so why can't we have
> > >>> simpler name like we have in 10.1.1, so it is easy to write the
> script.
> > >>>
> > >>> Thanks
> > >>> Sachin
> > >>
> > >>
> >
> >
>
>
> --
> -- Guozhang
>

Re: Need some help in identifying some important metrics to monitor for streams

Posted by Guozhang Wang <wa...@gmail.com>.

Sachin,

The reason that you got metrics name as

new-part-advice-d1094e71-0f59-45e8-98f4-477f9444aa91-StreamThread-1


Is that you did not set the "CLIENT_ID_CONFIG" in your app, and
KafkaStreams have to use a default combo of "appID:
new-part-advice"-"processID: a UUID to guarantee uniqueness across
machines" as its clientId.


As for metricsName, it is always set as "clientId + "-" + threadName" where
"StreamThread-1" is your threadName which is unique WITHIN the JVM and that
is why we still need the globally unique clientId for distinguishment.

I just checked the source code and this logic was not changed from 0.10.1
to 0.10.2, so I guess you set your clientId as "new-advice-1" as well in
0.10.1?


Guozhang



On Fri, Mar 3, 2017 at 4:02 AM, Eno Thereska <en...@gmail.com> wrote:

> Hi Sachin,
>
> Now that the confluent platform 3.2 is out, we also have some more
> documentation on this here: http://docs.confluent.io/3.2.
> 0/streams/monitoring.html <http://docs.confluent.io/3.2.
> 0/streams/monitoring.html>. We added a note on how to add other metrics.
>
> Yeah, your calculation on poll time makes sense. The important metrics are
> the “info” ones that are on by default. However, for stageful applications,
> if you suspect that state stores might be bottlenecking, you might want to
> collect those metrics too.
>
> On the benchmarks, the one called “processstreamwithstatestore” and
> “count” are the closest to a benchmarking on RocksDb with the default
> configs. The first writes each record to RocksDb, while the second performs
> simple aggregates (reads and writes from/to RocksDb).
>
> We might need to add more benchmarks here, would be great to get some
> ideas and help from the community. E.g., a pure RocksDb benchmark that
> doesn’t go through streams at all.
>
> Could you open a JIRA on the name issue please? As an “improvement”.
>
> Thanks
> Eno
>
>
>
> > On Mar 2, 2017, at 6:00 PM, Sachin Mittal <sj...@gmail.com> wrote:
> >
> > Hi,
> > I had checked the monitoring docs, but could not figure out which metrics
> > are important ones.
> >
> > Also mainly I am looking at the average time spent between 2 successive
> > poll requests.
> > Can I say that average time between 2 poll requests is sum of
> >
> > commit + poll + process + punctuate (latency-avg).
> >
> >
> > Also I checked the benchmark tests results but could not find any
> > information on rocksdb metrics for fetch and put operations.
> > Is there any benchmark for these or based on my values in previous mail
> can
> > something be commented on its performance.
> >
> >
> > Lastly can we get some help on names like new-part-advice-d1094e71-0f59-
> > 45e8-98f4-477f9444aa91-StreamThread-1 and have more standard name of
> thread
> > like new-advice-1-StreamThread-1(as in version 10.1.1) so we can log
> these
> > metrics as part of out cron jobs.
> >
> > Thanks
> > Sachin
> >
> >
> >
> > On Thu, Mar 2, 2017 at 9:31 PM, Eno Thereska <en...@gmail.com>
> wrote:
> >
> >> Hi Sachin,
> >>
> >> The new streams metrics are now documented at https://kafka.apache.org/
> >> documentation/#kafka_streams_monitoring <https://kafka.apache.org/
> >> documentation/#kafka_streams_monitoring>. Note that not all of them are
> >> turned on by default.
> >>
> >> We have several benchmarks that run nightly to monitor streams
> >> performance. They all stem from the SimpleBenchmark.java benchmark. In
> >> addition, their results are published nightly here
> >> http://testing.confluent.io <http://testing.confluent.io/>, (e.g.,
> under
> >> the trunk results). E.g., looking at today's results:
> >> http://confluent-kafka-system-test-results.s3-us-west-2.
> >> amazonaws.com/2017-03-02--001.1488449554--apache--trunk--
> >> ef92bb4/report.html <http://confluent-kafka-system-test-results.s3-us-
> >> west-2.amazonaws.com/2017-03-02--001.1488449554--apache--
> >> trunk--ef92bb4/report.html>
> >> (if you search for "benchmarks.streams") you'll see results from a
> series
> >> of benchmarks, ranging from simply consuming, to simple topologies with
> a
> >> source and sink, to joins and count aggregate. These run on AWS nightly,
> >> but you can also run manually on your setup.
> >>
> >> In addition, programmatically the code can check the
> KafkaStreams.state()
> >> and register listeners for when the state changes. For example, the
> state
> >> can change from "running" to "rebalancing".
> >>
> >> It is likely we'll need more metrics moving forward and would be great
> to
> >> get feedback from the community.
> >>
> >>
> >> Thanks
> >> Eno
> >>
> >>
> >>
> >>
> >>> On 2 Mar 2017, at 11:54, Sachin Mittal <sj...@gmail.com> wrote:
> >>>
> >>> Hello All,
> >>> I had few questions regarding monitoring of kafka streams application
> and
> >>> what are some important metrics we should collect in our case.
> >>>
> >>> Just a brief overview, we have a single thread application (0.10.1.1)
> >>> reading from single partition topic and it is working all fine.
> >>> Then we have same application (using 0.10.2.0) multi threaded with 4
> >>> threads per machine and 3 machines cluster setup reading for same but
> >>> partitioned topic (12 partitions).
> >>> Thus we have each thread processing single partition same case as
> earlier
> >>> one.
> >>>
> >>> The new setup also works fine in steady state, but under load somehow
> it
> >>> triggers frequent re-balance and then we run into all sort of issues
> like
> >>> stream thread dying due to CommitFailedException or entering into
> >> deadlock
> >>> state.
> >>> After a while we restart all the instances then it works fine for a
> while
> >>> and again we get the same problem and it goes on.
> >>>
> >>> 1. So just to monitor, like when first thread fails what would be some
> >>> important metrics we should be collecting to get some sense of whats
> >> going
> >>> on?
> >>>
> >>> 2. Is there any metric that tells time elapsed between successive poll
> >>> requests, so we can monitor that?
> >>>
> >>> Also I did monitor rocksdb put and fetch times for these 2 instances
> and
> >>> here is the output I get:
> >>> 0.10.1.1
> >>> $>get -s  -b kafka.streams:type=stream-rocksdb-window-metrics,client-
> >> id=new-advice-1-StreamThread-1
> >>> key-table-put-avg-latency-ms
> >>> #mbean = kafka.streams:type=stream-rocksdb-window-metrics,client-
> >>> id=new-advice-1-StreamThread-1:
> >>> 206431.7497615029
> >>> $>get -s  -b kafka.streams:type=stream-rocksdb-window-metrics,client-
> >> id=new-advice-1-StreamThread-1
> >>> key-table-fetch-avg-latency-ms
> >>> #mbean = kafka.streams:type=stream-rocksdb-window-metrics,client-
> >>> id=new-advice-1-StreamThread-1:
> >>> 2595394.2746129474
> >>> $>get -s  -b kafka.streams:type=stream-rocksdb-window-metrics,client-
> >> id=new-advice-1-StreamThread-1
> >>> key-table-put-qps
> >>> #mbean = kafka.streams:type=stream-rocksdb-window-metrics,client-
> >>> id=new-advice-1-StreamThread-1:
> >>> 232.86299499317252
> >>> $>get -s  -b kafka.streams:type=stream-rocksdb-window-metrics,client-
> >> id=new-advice-1-StreamThread-1
> >>> key-table-fetch-qps
> >>> #mbean = kafka.streams:type=stream-rocksdb-window-metrics,client-
> >>> id=new-advice-1-StreamThread-1:
> >>> 373.61071016166284
> >>>
> >>> Same values for 0.10.2.0 I get
> >>> $>get -s -b kafka.streams:type=stream-rocksdb-window-metrics,client-
> >>> id=new-part-advice-d1094e71-0f59-45e8-98f4-477f9444aa91-StreamThread-1
> >>> key-table-put-latency-avg
> >>> #mbean = kafka.streams:type=stream-rocksdb-window-metrics,client-
> >>> id=new-part-advice-d1094e71-0f59-45e8-98f4-477f9444aa91-
> StreamThread-1:
> >>> 1199859.5535022356
> >>> $>get -s -b kafka.streams:type=stream-rocksdb-window-metrics,client-
> >>> id=new-part-advice-d1094e71-0f59-45e8-98f4-477f9444aa91-StreamThread-1
> >>> key-table-fetch-latency-avg
> >>> #mbean = kafka.streams:type=stream-rocksdb-window-metrics,client-
> >>> id=new-part-advice-d1094e71-0f59-45e8-98f4-477f9444aa91-
> StreamThread-1:
> >>> 3679340.80748852
> >>> $>get -s -b kafka.streams:type=stream-rocksdb-window-metrics,client-
> >>> id=new-part-advice-d1094e71-0f59-45e8-98f4-477f9444aa91-StreamThread-1
> >>> key-table-put-rate
> >>> #mbean = kafka.streams:type=stream-rocksdb-window-metrics,client-
> >>> id=new-part-advice-d1094e71-0f59-45e8-98f4-477f9444aa91-
> StreamThread-1:
> >>> 56.134778706069184
> >>> $>get -s -b kafka.streams:type=stream-rocksdb-window-metrics,client-
> >>> id=new-part-advice-d1094e71-0f59-45e8-98f4-477f9444aa91-StreamThread-1
> >>> key-table-fetch-rate
> >>> #mbean = kafka.streams:type=stream-rocksdb-window-metrics,client-
> >>> id=new-part-advice-d1094e71-0f59-45e8-98f4-477f9444aa91-
> StreamThread-1:
> >>> 136.10721427931827
> >>>
> >>> I notice that result in 10.2.0 is much worse than same for 10.1.1
> >>>
> >>> I would like to know
> >>> 1. Is there any benchmark on rocksdb as at what rate/latency it should
> be
> >>> doing put/fetch operations.
> >>>
> >>> 2. What could be the cause of inferior numbers in 10.2.0, is it because
> >>> this application is also running three other threads doing the same
> >> thing.
> >>>
> >>> 3. Also whats with the name new-part-advice-d1094e71-
> >>> 0f59-45e8-98f4-477f9444aa91-StreamThread-1
> >>>   I wanted to put this as a part of my cronjob, so why can't we have
> >>> simpler name like we have in 10.1.1, so it is easy to write the script.
> >>>
> >>> Thanks
> >>> Sachin
> >>
> >>
>
>


-- 
-- Guozhang

Re: Need some help in identifying some important metrics to monitor for streams

Posted by Eno Thereska <en...@gmail.com>.

Hi Sachin,

Now that the confluent platform 3.2 is out, we also have some more documentation on this here: http://docs.confluent.io/3.2.0/streams/monitoring.html <http://docs.confluent.io/3.2.0/streams/monitoring.html>. We added a note on how to add other metrics.

Yeah, your calculation on poll time makes sense. The important metrics are the “info” ones that are on by default. However, for stageful applications, if you suspect that state stores might be bottlenecking, you might want to collect those metrics too. 

On the benchmarks, the one called “processstreamwithstatestore” and “count” are the closest to a benchmarking on RocksDb with the default configs. The first writes each record to RocksDb, while the second performs simple aggregates (reads and writes from/to RocksDb). 

We might need to add more benchmarks here, would be great to get some ideas and help from the community. E.g., a pure RocksDb benchmark that doesn’t go through streams at all. 

Could you open a JIRA on the name issue please? As an “improvement”.

Thanks
Eno



> On Mar 2, 2017, at 6:00 PM, Sachin Mittal <sj...@gmail.com> wrote:
> 
> Hi,
> I had checked the monitoring docs, but could not figure out which metrics
> are important ones.
> 
> Also mainly I am looking at the average time spent between 2 successive
> poll requests.
> Can I say that average time between 2 poll requests is sum of
> 
> commit + poll + process + punctuate (latency-avg).
> 
> 
> Also I checked the benchmark tests results but could not find any
> information on rocksdb metrics for fetch and put operations.
> Is there any benchmark for these or based on my values in previous mail can
> something be commented on its performance.
> 
> 
> Lastly can we get some help on names like new-part-advice-d1094e71-0f59-
> 45e8-98f4-477f9444aa91-StreamThread-1 and have more standard name of thread
> like new-advice-1-StreamThread-1(as in version 10.1.1) so we can log these
> metrics as part of out cron jobs.
> 
> Thanks
> Sachin
> 
> 
> 
> On Thu, Mar 2, 2017 at 9:31 PM, Eno Thereska <en...@gmail.com> wrote:
> 
>> Hi Sachin,
>> 
>> The new streams metrics are now documented at https://kafka.apache.org/
>> documentation/#kafka_streams_monitoring <https://kafka.apache.org/
>> documentation/#kafka_streams_monitoring>. Note that not all of them are
>> turned on by default.
>> 
>> We have several benchmarks that run nightly to monitor streams
>> performance. They all stem from the SimpleBenchmark.java benchmark. In
>> addition, their results are published nightly here
>> http://testing.confluent.io <http://testing.confluent.io/>, (e.g., under
>> the trunk results). E.g., looking at today's results:
>> http://confluent-kafka-system-test-results.s3-us-west-2.
>> amazonaws.com/2017-03-02--001.1488449554--apache--trunk--
>> ef92bb4/report.html <http://confluent-kafka-system-test-results.s3-us-
>> west-2.amazonaws.com/2017-03-02--001.1488449554--apache--
>> trunk--ef92bb4/report.html>
>> (if you search for "benchmarks.streams") you'll see results from a series
>> of benchmarks, ranging from simply consuming, to simple topologies with a
>> source and sink, to joins and count aggregate. These run on AWS nightly,
>> but you can also run manually on your setup.
>> 
>> In addition, programmatically the code can check the KafkaStreams.state()
>> and register listeners for when the state changes. For example, the state
>> can change from "running" to "rebalancing".
>> 
>> It is likely we'll need more metrics moving forward and would be great to
>> get feedback from the community.
>> 
>> 
>> Thanks
>> Eno
>> 
>> 
>> 
>> 
>>> On 2 Mar 2017, at 11:54, Sachin Mittal <sj...@gmail.com> wrote:
>>> 
>>> Hello All,
>>> I had few questions regarding monitoring of kafka streams application and
>>> what are some important metrics we should collect in our case.
>>> 
>>> Just a brief overview, we have a single thread application (0.10.1.1)
>>> reading from single partition topic and it is working all fine.
>>> Then we have same application (using 0.10.2.0) multi threaded with 4
>>> threads per machine and 3 machines cluster setup reading for same but
>>> partitioned topic (12 partitions).
>>> Thus we have each thread processing single partition same case as earlier
>>> one.
>>> 
>>> The new setup also works fine in steady state, but under load somehow it
>>> triggers frequent re-balance and then we run into all sort of issues like
>>> stream thread dying due to CommitFailedException or entering into
>> deadlock
>>> state.
>>> After a while we restart all the instances then it works fine for a while
>>> and again we get the same problem and it goes on.
>>> 
>>> 1. So just to monitor, like when first thread fails what would be some
>>> important metrics we should be collecting to get some sense of whats
>> going
>>> on?
>>> 
>>> 2. Is there any metric that tells time elapsed between successive poll
>>> requests, so we can monitor that?
>>> 
>>> Also I did monitor rocksdb put and fetch times for these 2 instances and
>>> here is the output I get:
>>> 0.10.1.1
>>> $>get -s  -b kafka.streams:type=stream-rocksdb-window-metrics,client-
>> id=new-advice-1-StreamThread-1
>>> key-table-put-avg-latency-ms
>>> #mbean = kafka.streams:type=stream-rocksdb-window-metrics,client-
>>> id=new-advice-1-StreamThread-1:
>>> 206431.7497615029
>>> $>get -s  -b kafka.streams:type=stream-rocksdb-window-metrics,client-
>> id=new-advice-1-StreamThread-1
>>> key-table-fetch-avg-latency-ms
>>> #mbean = kafka.streams:type=stream-rocksdb-window-metrics,client-
>>> id=new-advice-1-StreamThread-1:
>>> 2595394.2746129474
>>> $>get -s  -b kafka.streams:type=stream-rocksdb-window-metrics,client-
>> id=new-advice-1-StreamThread-1
>>> key-table-put-qps
>>> #mbean = kafka.streams:type=stream-rocksdb-window-metrics,client-
>>> id=new-advice-1-StreamThread-1:
>>> 232.86299499317252
>>> $>get -s  -b kafka.streams:type=stream-rocksdb-window-metrics,client-
>> id=new-advice-1-StreamThread-1
>>> key-table-fetch-qps
>>> #mbean = kafka.streams:type=stream-rocksdb-window-metrics,client-
>>> id=new-advice-1-StreamThread-1:
>>> 373.61071016166284
>>> 
>>> Same values for 0.10.2.0 I get
>>> $>get -s -b kafka.streams:type=stream-rocksdb-window-metrics,client-
>>> id=new-part-advice-d1094e71-0f59-45e8-98f4-477f9444aa91-StreamThread-1
>>> key-table-put-latency-avg
>>> #mbean = kafka.streams:type=stream-rocksdb-window-metrics,client-
>>> id=new-part-advice-d1094e71-0f59-45e8-98f4-477f9444aa91-StreamThread-1:
>>> 1199859.5535022356
>>> $>get -s -b kafka.streams:type=stream-rocksdb-window-metrics,client-
>>> id=new-part-advice-d1094e71-0f59-45e8-98f4-477f9444aa91-StreamThread-1
>>> key-table-fetch-latency-avg
>>> #mbean = kafka.streams:type=stream-rocksdb-window-metrics,client-
>>> id=new-part-advice-d1094e71-0f59-45e8-98f4-477f9444aa91-StreamThread-1:
>>> 3679340.80748852
>>> $>get -s -b kafka.streams:type=stream-rocksdb-window-metrics,client-
>>> id=new-part-advice-d1094e71-0f59-45e8-98f4-477f9444aa91-StreamThread-1
>>> key-table-put-rate
>>> #mbean = kafka.streams:type=stream-rocksdb-window-metrics,client-
>>> id=new-part-advice-d1094e71-0f59-45e8-98f4-477f9444aa91-StreamThread-1:
>>> 56.134778706069184
>>> $>get -s -b kafka.streams:type=stream-rocksdb-window-metrics,client-
>>> id=new-part-advice-d1094e71-0f59-45e8-98f4-477f9444aa91-StreamThread-1
>>> key-table-fetch-rate
>>> #mbean = kafka.streams:type=stream-rocksdb-window-metrics,client-
>>> id=new-part-advice-d1094e71-0f59-45e8-98f4-477f9444aa91-StreamThread-1:
>>> 136.10721427931827
>>> 
>>> I notice that result in 10.2.0 is much worse than same for 10.1.1
>>> 
>>> I would like to know
>>> 1. Is there any benchmark on rocksdb as at what rate/latency it should be
>>> doing put/fetch operations.
>>> 
>>> 2. What could be the cause of inferior numbers in 10.2.0, is it because
>>> this application is also running three other threads doing the same
>> thing.
>>> 
>>> 3. Also whats with the name new-part-advice-d1094e71-
>>> 0f59-45e8-98f4-477f9444aa91-StreamThread-1
>>>   I wanted to put this as a part of my cronjob, so why can't we have
>>> simpler name like we have in 10.1.1, so it is easy to write the script.
>>> 
>>> Thanks
>>> Sachin
>> 
>>

Re: Need some help in identifying some important metrics to monitor for streams

Posted by Sachin Mittal <sj...@gmail.com>.

Hi,
I had checked the monitoring docs, but could not figure out which metrics
are important ones.

Also mainly I am looking at the average time spent between 2 successive
poll requests.
Can I say that average time between 2 poll requests is sum of

commit + poll + process + punctuate (latency-avg).


Also I checked the benchmark tests results but could not find any
information on rocksdb metrics for fetch and put operations.
Is there any benchmark for these or based on my values in previous mail can
something be commented on its performance.


Lastly can we get some help on names like new-part-advice-d1094e71-0f59-
45e8-98f4-477f9444aa91-StreamThread-1 and have more standard name of thread
like new-advice-1-StreamThread-1(as in version 10.1.1) so we can log these
metrics as part of out cron jobs.

Thanks
Sachin



On Thu, Mar 2, 2017 at 9:31 PM, Eno Thereska <en...@gmail.com> wrote:

> Hi Sachin,
>
> The new streams metrics are now documented at https://kafka.apache.org/
> documentation/#kafka_streams_monitoring <https://kafka.apache.org/
> documentation/#kafka_streams_monitoring>. Note that not all of them are
> turned on by default.
>
> We have several benchmarks that run nightly to monitor streams
> performance. They all stem from the SimpleBenchmark.java benchmark. In
> addition, their results are published nightly here
> http://testing.confluent.io <http://testing.confluent.io/>, (e.g., under
> the trunk results). E.g., looking at today's results:
> http://confluent-kafka-system-test-results.s3-us-west-2.
> amazonaws.com/2017-03-02--001.1488449554--apache--trunk--
> ef92bb4/report.html <http://confluent-kafka-system-test-results.s3-us-
> west-2.amazonaws.com/2017-03-02--001.1488449554--apache--
> trunk--ef92bb4/report.html>
> (if you search for "benchmarks.streams") you'll see results from a series
> of benchmarks, ranging from simply consuming, to simple topologies with a
> source and sink, to joins and count aggregate. These run on AWS nightly,
> but you can also run manually on your setup.
>
> In addition, programmatically the code can check the KafkaStreams.state()
> and register listeners for when the state changes. For example, the state
> can change from "running" to "rebalancing".
>
> It is likely we'll need more metrics moving forward and would be great to
> get feedback from the community.
>
>
> Thanks
> Eno
>
>
>
>
> > On 2 Mar 2017, at 11:54, Sachin Mittal <sj...@gmail.com> wrote:
> >
> > Hello All,
> > I had few questions regarding monitoring of kafka streams application and
> > what are some important metrics we should collect in our case.
> >
> > Just a brief overview, we have a single thread application (0.10.1.1)
> > reading from single partition topic and it is working all fine.
> > Then we have same application (using 0.10.2.0) multi threaded with 4
> > threads per machine and 3 machines cluster setup reading for same but
> > partitioned topic (12 partitions).
> > Thus we have each thread processing single partition same case as earlier
> > one.
> >
> > The new setup also works fine in steady state, but under load somehow it
> > triggers frequent re-balance and then we run into all sort of issues like
> > stream thread dying due to CommitFailedException or entering into
> deadlock
> > state.
> > After a while we restart all the instances then it works fine for a while
> > and again we get the same problem and it goes on.
> >
> > 1. So just to monitor, like when first thread fails what would be some
> > important metrics we should be collecting to get some sense of whats
> going
> > on?
> >
> > 2. Is there any metric that tells time elapsed between successive poll
> > requests, so we can monitor that?
> >
> > Also I did monitor rocksdb put and fetch times for these 2 instances and
> > here is the output I get:
> > 0.10.1.1
> > $>get -s  -b kafka.streams:type=stream-rocksdb-window-metrics,client-
> id=new-advice-1-StreamThread-1
> > key-table-put-avg-latency-ms
> > #mbean = kafka.streams:type=stream-rocksdb-window-metrics,client-
> > id=new-advice-1-StreamThread-1:
> > 206431.7497615029
> > $>get -s  -b kafka.streams:type=stream-rocksdb-window-metrics,client-
> id=new-advice-1-StreamThread-1
> > key-table-fetch-avg-latency-ms
> > #mbean = kafka.streams:type=stream-rocksdb-window-metrics,client-
> > id=new-advice-1-StreamThread-1:
> > 2595394.2746129474
> > $>get -s  -b kafka.streams:type=stream-rocksdb-window-metrics,client-
> id=new-advice-1-StreamThread-1
> > key-table-put-qps
> > #mbean = kafka.streams:type=stream-rocksdb-window-metrics,client-
> > id=new-advice-1-StreamThread-1:
> > 232.86299499317252
> > $>get -s  -b kafka.streams:type=stream-rocksdb-window-metrics,client-
> id=new-advice-1-StreamThread-1
> > key-table-fetch-qps
> > #mbean = kafka.streams:type=stream-rocksdb-window-metrics,client-
> > id=new-advice-1-StreamThread-1:
> > 373.61071016166284
> >
> > Same values for 0.10.2.0 I get
> > $>get -s -b kafka.streams:type=stream-rocksdb-window-metrics,client-
> > id=new-part-advice-d1094e71-0f59-45e8-98f4-477f9444aa91-StreamThread-1
> > key-table-put-latency-avg
> > #mbean = kafka.streams:type=stream-rocksdb-window-metrics,client-
> > id=new-part-advice-d1094e71-0f59-45e8-98f4-477f9444aa91-StreamThread-1:
> > 1199859.5535022356
> > $>get -s -b kafka.streams:type=stream-rocksdb-window-metrics,client-
> > id=new-part-advice-d1094e71-0f59-45e8-98f4-477f9444aa91-StreamThread-1
> > key-table-fetch-latency-avg
> > #mbean = kafka.streams:type=stream-rocksdb-window-metrics,client-
> > id=new-part-advice-d1094e71-0f59-45e8-98f4-477f9444aa91-StreamThread-1:
> > 3679340.80748852
> > $>get -s -b kafka.streams:type=stream-rocksdb-window-metrics,client-
> > id=new-part-advice-d1094e71-0f59-45e8-98f4-477f9444aa91-StreamThread-1
> > key-table-put-rate
> > #mbean = kafka.streams:type=stream-rocksdb-window-metrics,client-
> > id=new-part-advice-d1094e71-0f59-45e8-98f4-477f9444aa91-StreamThread-1:
> > 56.134778706069184
> > $>get -s -b kafka.streams:type=stream-rocksdb-window-metrics,client-
> > id=new-part-advice-d1094e71-0f59-45e8-98f4-477f9444aa91-StreamThread-1
> > key-table-fetch-rate
> > #mbean = kafka.streams:type=stream-rocksdb-window-metrics,client-
> > id=new-part-advice-d1094e71-0f59-45e8-98f4-477f9444aa91-StreamThread-1:
> > 136.10721427931827
> >
> > I notice that result in 10.2.0 is much worse than same for 10.1.1
> >
> > I would like to know
> > 1. Is there any benchmark on rocksdb as at what rate/latency it should be
> > doing put/fetch operations.
> >
> > 2. What could be the cause of inferior numbers in 10.2.0, is it because
> > this application is also running three other threads doing the same
> thing.
> >
> > 3. Also whats with the name new-part-advice-d1094e71-
> > 0f59-45e8-98f4-477f9444aa91-StreamThread-1
> >    I wanted to put this as a part of my cronjob, so why can't we have
> > simpler name like we have in 10.1.1, so it is easy to write the script.
> >
> > Thanks
> > Sachin
>
>

Re: Need some help in identifying some important metrics to monitor for streams

Posted by Eno Thereska <en...@gmail.com>.

Hi Sachin,

The new streams metrics are now documented at https://kafka.apache.org/documentation/#kafka_streams_monitoring <https://kafka.apache.org/documentation/#kafka_streams_monitoring>. Note that not all of them are turned on by default. 

We have several benchmarks that run nightly to monitor streams performance. They all stem from the SimpleBenchmark.java benchmark. In addition, their results are published nightly here http://testing.confluent.io <http://testing.confluent.io/>, (e.g., under the trunk results). E.g., looking at today's results:
http://confluent-kafka-system-test-results.s3-us-west-2.amazonaws.com/2017-03-02--001.1488449554--apache--trunk--ef92bb4/report.html <http://confluent-kafka-system-test-results.s3-us-west-2.amazonaws.com/2017-03-02--001.1488449554--apache--trunk--ef92bb4/report.html>
(if you search for "benchmarks.streams") you'll see results from a series of benchmarks, ranging from simply consuming, to simple topologies with a source and sink, to joins and count aggregate. These run on AWS nightly, but you can also run manually on your setup.

In addition, programmatically the code can check the KafkaStreams.state() and register listeners for when the state changes. For example, the state can change from "running" to "rebalancing".

It is likely we'll need more metrics moving forward and would be great to get feedback from the community.


Thanks
Eno




> On 2 Mar 2017, at 11:54, Sachin Mittal <sj...@gmail.com> wrote:
> 
> Hello All,
> I had few questions regarding monitoring of kafka streams application and
> what are some important metrics we should collect in our case.
> 
> Just a brief overview, we have a single thread application (0.10.1.1)
> reading from single partition topic and it is working all fine.
> Then we have same application (using 0.10.2.0) multi threaded with 4
> threads per machine and 3 machines cluster setup reading for same but
> partitioned topic (12 partitions).
> Thus we have each thread processing single partition same case as earlier
> one.
> 
> The new setup also works fine in steady state, but under load somehow it
> triggers frequent re-balance and then we run into all sort of issues like
> stream thread dying due to CommitFailedException or entering into deadlock
> state.
> After a while we restart all the instances then it works fine for a while
> and again we get the same problem and it goes on.
> 
> 1. So just to monitor, like when first thread fails what would be some
> important metrics we should be collecting to get some sense of whats going
> on?
> 
> 2. Is there any metric that tells time elapsed between successive poll
> requests, so we can monitor that?
> 
> Also I did monitor rocksdb put and fetch times for these 2 instances and
> here is the output I get:
> 0.10.1.1
> $>get -s  -b kafka.streams:type=stream-rocksdb-window-metrics,client-id=new-advice-1-StreamThread-1
> key-table-put-avg-latency-ms
> #mbean = kafka.streams:type=stream-rocksdb-window-metrics,client-
> id=new-advice-1-StreamThread-1:
> 206431.7497615029
> $>get -s  -b kafka.streams:type=stream-rocksdb-window-metrics,client-id=new-advice-1-StreamThread-1
> key-table-fetch-avg-latency-ms
> #mbean = kafka.streams:type=stream-rocksdb-window-metrics,client-
> id=new-advice-1-StreamThread-1:
> 2595394.2746129474
> $>get -s  -b kafka.streams:type=stream-rocksdb-window-metrics,client-id=new-advice-1-StreamThread-1
> key-table-put-qps
> #mbean = kafka.streams:type=stream-rocksdb-window-metrics,client-
> id=new-advice-1-StreamThread-1:
> 232.86299499317252
> $>get -s  -b kafka.streams:type=stream-rocksdb-window-metrics,client-id=new-advice-1-StreamThread-1
> key-table-fetch-qps
> #mbean = kafka.streams:type=stream-rocksdb-window-metrics,client-
> id=new-advice-1-StreamThread-1:
> 373.61071016166284
> 
> Same values for 0.10.2.0 I get
> $>get -s -b kafka.streams:type=stream-rocksdb-window-metrics,client-
> id=new-part-advice-d1094e71-0f59-45e8-98f4-477f9444aa91-StreamThread-1
> key-table-put-latency-avg
> #mbean = kafka.streams:type=stream-rocksdb-window-metrics,client-
> id=new-part-advice-d1094e71-0f59-45e8-98f4-477f9444aa91-StreamThread-1:
> 1199859.5535022356
> $>get -s -b kafka.streams:type=stream-rocksdb-window-metrics,client-
> id=new-part-advice-d1094e71-0f59-45e8-98f4-477f9444aa91-StreamThread-1
> key-table-fetch-latency-avg
> #mbean = kafka.streams:type=stream-rocksdb-window-metrics,client-
> id=new-part-advice-d1094e71-0f59-45e8-98f4-477f9444aa91-StreamThread-1:
> 3679340.80748852
> $>get -s -b kafka.streams:type=stream-rocksdb-window-metrics,client-
> id=new-part-advice-d1094e71-0f59-45e8-98f4-477f9444aa91-StreamThread-1
> key-table-put-rate
> #mbean = kafka.streams:type=stream-rocksdb-window-metrics,client-
> id=new-part-advice-d1094e71-0f59-45e8-98f4-477f9444aa91-StreamThread-1:
> 56.134778706069184
> $>get -s -b kafka.streams:type=stream-rocksdb-window-metrics,client-
> id=new-part-advice-d1094e71-0f59-45e8-98f4-477f9444aa91-StreamThread-1
> key-table-fetch-rate
> #mbean = kafka.streams:type=stream-rocksdb-window-metrics,client-
> id=new-part-advice-d1094e71-0f59-45e8-98f4-477f9444aa91-StreamThread-1:
> 136.10721427931827
> 
> I notice that result in 10.2.0 is much worse than same for 10.1.1
> 
> I would like to know
> 1. Is there any benchmark on rocksdb as at what rate/latency it should be
> doing put/fetch operations.
> 
> 2. What could be the cause of inferior numbers in 10.2.0, is it because
> this application is also running three other threads doing the same thing.
> 
> 3. Also whats with the name new-part-advice-d1094e71-
> 0f59-45e8-98f4-477f9444aa91-StreamThread-1
>    I wanted to put this as a part of my cronjob, so why can't we have
> simpler name like we have in 10.1.1, so it is easy to write the script.
> 
> Thanks
> Sachin