You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@kafka.apache.org by chintan mavawala <ch...@gmail.com> on 2019/11/21 09:27:46 UTC

Unique users per calendar month using kafka streams

Hi,

We have a use case to capture number of unique users per month. We planned
to use windowing concept for this.

For example, group events from input topic by user name and later sub group
them based on time window. However i don't see how i can sub group the
results based on particular month, say January. The only way is sub group
based on time.

Any pointers would be appreciated.

Regards,
Chintan

Re: Unique users per calendar month using kafka streams

Posted by "Matthias J. Sax" <ma...@confluent.io>.
While Kafka Streams does not support monthly windows out-of-the box, it
is possible to define you own custom windows.

You can find an example that defines "daily windows", including timezone
support, on GitHub:
https://github.com/confluentinc/kafka-streams-examples/blob/5.3.1-post/src/test/java/io/confluent/examples/streams/window/DailyTimeWindows.java

This should help you to define a custom window base on calendar month.

Hope this helps.


-Matthias

On 11/21/19 3:51 AM, claude.warren@wipro.com.INVALID wrote:
> A different approach would be to integrate the Apache DataSketches  (https://datasketches.apache.org/) which have mathematical proofs behind them.  Using a DataSketch you can capture unique members for any given time period in a very small data object and be able to aggregate them (even though unique counts are not in and of themselves aggregateable).  For example you could take the monthly measures and calculate the unique users per quarter or for the entire year very quickly.  Generally orders of magnitude faster.
> 
> ________________________________
> From: Bruno Cadonna <br...@confluent.io>
> Sent: Thursday, November 21, 2019 11:37
> To: Users <us...@kafka.apache.org>
> Subject: Re: Unique users per calendar month using kafka streams
> 
> ** This mail has been sent from an external source. Treat hyperlinks and attachments in this email with caution**
> 
> Hi Chintan,
> 
> You cannot specify time windows based on a calendar object like months.
> 
> In the following, I suppose the keys of your records are user IDs. You
> could extract the months from the timestamps of the events and add
> them to the key of your records. Then you can group the records by key
> and count them. Be aware that your state that stores the counts will
> grow indefinitely and therefore you need to take care how to remove
> counts you do not need anymore from your local state.
> 
> Take a look at the following example of how to deduplicate records
> 
> https://clicktime.symantec.com/3E6BmtgzXaCnuSmDcxKqdKD7Vc?u=https%3A%2F%2Fgithub.com%2Fconfluentinc%2Fkafka-streams-examples%2Fblob%2F5.3.1-post%2Fsrc%2Ftest%2Fjava%2Fio%2Fconfluent%2Fexamples%2Fstreams%2FEventDeduplicationLambdaIntegrationTest.java
> 
> It shows how to avoid indefinite growing of local store in such cases.
> Try to adapt it to solve your problem by extending the key with the
> month and computing the count instead of looking for duplicates.
> 
> Best,
> Bruno
> 
> On Thu, Nov 21, 2019 at 10:28 AM chintan mavawala
> <ch...@gmail.com> wrote:
>>
>> Hi,
>>
>> We have a use case to capture number of unique users per month. We planned
>> to use windowing concept for this.
>>
>> For example, group events from input topic by user name and later sub group
>> them based on time window. However i don't see how i can sub group the
>> results based on particular month, say January. The only way is sub group
>> based on time.
>>
>> Any pointers would be appreciated.
>>
>> Regards,
>> Chintan
> The information contained in this electronic message and any attachments to this message are intended for the exclusive use of the addressee(s) and may contain proprietary, confidential or privileged information. If you are not the intended recipient, you should not disseminate, distribute or copy this e-mail. Please notify the sender immediately and destroy all copies of this message and any attachments. WARNING: Computer viruses can be transmitted via email. The recipient should check this email and any attachments for the presence of viruses. The company accepts no liability for any damage caused by any virus transmitted by this email. www.wipro.com
> 


Re: Unique users per calendar month using kafka streams

Posted by "claude.warren@wipro.com.INVALID" <cl...@wipro.com.INVALID>.
A different approach would be to integrate the Apache DataSketches  (https://datasketches.apache.org/) which have mathematical proofs behind them.  Using a DataSketch you can capture unique members for any given time period in a very small data object and be able to aggregate them (even though unique counts are not in and of themselves aggregateable).  For example you could take the monthly measures and calculate the unique users per quarter or for the entire year very quickly.  Generally orders of magnitude faster.

________________________________
From: Bruno Cadonna <br...@confluent.io>
Sent: Thursday, November 21, 2019 11:37
To: Users <us...@kafka.apache.org>
Subject: Re: Unique users per calendar month using kafka streams

** This mail has been sent from an external source. Treat hyperlinks and attachments in this email with caution**

Hi Chintan,

You cannot specify time windows based on a calendar object like months.

In the following, I suppose the keys of your records are user IDs. You
could extract the months from the timestamps of the events and add
them to the key of your records. Then you can group the records by key
and count them. Be aware that your state that stores the counts will
grow indefinitely and therefore you need to take care how to remove
counts you do not need anymore from your local state.

Take a look at the following example of how to deduplicate records

https://clicktime.symantec.com/3E6BmtgzXaCnuSmDcxKqdKD7Vc?u=https%3A%2F%2Fgithub.com%2Fconfluentinc%2Fkafka-streams-examples%2Fblob%2F5.3.1-post%2Fsrc%2Ftest%2Fjava%2Fio%2Fconfluent%2Fexamples%2Fstreams%2FEventDeduplicationLambdaIntegrationTest.java

It shows how to avoid indefinite growing of local store in such cases.
Try to adapt it to solve your problem by extending the key with the
month and computing the count instead of looking for duplicates.

Best,
Bruno

On Thu, Nov 21, 2019 at 10:28 AM chintan mavawala
<ch...@gmail.com> wrote:
>
> Hi,
>
> We have a use case to capture number of unique users per month. We planned
> to use windowing concept for this.
>
> For example, group events from input topic by user name and later sub group
> them based on time window. However i don't see how i can sub group the
> results based on particular month, say January. The only way is sub group
> based on time.
>
> Any pointers would be appreciated.
>
> Regards,
> Chintan
The information contained in this electronic message and any attachments to this message are intended for the exclusive use of the addressee(s) and may contain proprietary, confidential or privileged information. If you are not the intended recipient, you should not disseminate, distribute or copy this e-mail. Please notify the sender immediately and destroy all copies of this message and any attachments. WARNING: Computer viruses can be transmitted via email. The recipient should check this email and any attachments for the presence of viruses. The company accepts no liability for any damage caused by any virus transmitted by this email. www.wipro.com

Re: Unique users per calendar month using kafka streams

Posted by Bruno Cadonna <br...@confluent.io>.
Hi Chintan,

You cannot specify time windows based on a calendar object like months.

In the following, I suppose the keys of your records are user IDs. You
could extract the months from the timestamps of the events and add
them to the key of your records. Then you can group the records by key
and count them. Be aware that your state that stores the counts will
grow indefinitely and therefore you need to take care how to remove
counts you do not need anymore from your local state.

Take a look at the following example of how to deduplicate records

https://github.com/confluentinc/kafka-streams-examples/blob/5.3.1-post/src/test/java/io/confluent/examples/streams/EventDeduplicationLambdaIntegrationTest.java

It shows how to avoid indefinite growing of local store in such cases.
Try to adapt it to solve your problem by extending the key with the
month and computing the count instead of looking for duplicates.

Best,
Bruno

On Thu, Nov 21, 2019 at 10:28 AM chintan mavawala
<ch...@gmail.com> wrote:
>
> Hi,
>
> We have a use case to capture number of unique users per month. We planned
> to use windowing concept for this.
>
> For example, group events from input topic by user name and later sub group
> them based on time window. However i don't see how i can sub group the
> results based on particular month, say January. The only way is sub group
> based on time.
>
> Any pointers would be appreciated.
>
> Regards,
> Chintan