You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@kafka.apache.org by Hamza HACHANI <ha...@supcom.tn> on 2016/07/29 14:35:44 UTC

Kafka streams Issue

> Good morning,
>
> I'm an ICT student in TELECOM BRRETAGNE (a french school).
> I did follow your presentation in Youtube and i found them really
> intresting.
> I'm trying to do some stuffs with Kafka. And now it has been  about 3 days
> that I'm blocked.
> I'm trying to control the time in which my processing application send
> data to the output topic .
> What i'm trying to do is to make the application process data from the
> input topic all the time but send the messages only at the end of a
> minute/an hour/a month .... (the notion of windowing).
> For the moment what i managed to do is that the application instead of
> sending data only at the end of the minute,it send it anytime it does
> receive it from the input topic.
> Have you any suggestions to help me?
> I would be really gratfeul.


Preliminary answer for now:

> For the moment what i managed to do is that the application instead of
sending data only at the end
> of the minute,it send it anytime it does receive it from the input topic.

This is actually the expected behavior at the moment.

The main reason for this behavior is that, in stream processing, we never
know whether there is still late-arriving data to be received.  For
example, imagine you have 1-minute windows based on event-time.  Here, it
may happen that, after the first 1 minute window has passed, another record
arrives five minutes later but, according to the record's event-time, it
should have still been part of the first 1-minute window.  In this case,
what we typically want to happen is that the first 1-window will be
updated/reprocessed with the late-arriving record included.  In other
words, just because 1 minute has passed (= the 1-minute window is "done")
it does not mean that actually all the data for that time interval has been
processed already -- so sending only a single update after 1 minute has
passed would even produce incorrect results in many cases.  For this reason
you currently see a downstream update anytime there is a new incoming data
record ("send it anytime it does receive it from the input topic").  So the
point here is due ensure correctness of processing.

That said, one known drawback of the current behavior is that users haven't
been able to control (read: decrease/reduce) the rate/volume of the
resulting downstream updates.  For example, if you have an input topic with
a rate of 1 million msg/s (which is easy for Kafka), some users want to
aggregate/window results primarily to reduce the input rate to a lower
numbers (e.g. 1 thousand msg/s) so that the data can be fed from Kafka to
other systems that might not scale as well as Kafka.  To help these use
cases we will have a new configuration parameter in the next major version
of Kafka that allows you to control the rate/volume of downstream updates.
Here, the point is to help users optimize resource usage rather than
correctness of processing.  This new parameter should also help you with
your use case.  But even this new parameter is not based on strict time
behavior or time windows.


Re: Kafka streams Issue

Posted by Guozhang Wang <wa...@gmail.com>.
Hi Hamza,

We are also working on letting users to have some indirect control over the
data volume based on caching:

https://cwiki.apache.org/confluence/display/KAFKA/KIP-63%3A+Unify+store+and+downstream+caching+in+streams

Guozhang

On Fri, Jul 29, 2016 at 8:24 AM, Hamza HACHANI <ha...@supcom.tn>
wrote:

> Thanks i will try that.
>
>
> Hamza
>
> ________________________________
> De : Tauzell, Dave <Da...@surescripts.com>
> Envoyé : vendredi 29 juillet 2016 03:18:47
> À : users@kafka.apache.org
> Objet : RE: Kafka streams Issue
>
> Let's say you currently have:
>
> Procesing App    ---> OUTPUT TOPIC ---> output consumer
>
> You would ideally like the processing app to only write to the output
> topic every minute, but cannot easily do this.  So what you might be able
> to do is:
>
>
> Processing App ---> INTERMIDIATE OUTPUT TOPIC --->  Coalesce Process
> --->>= OUTPUT TOPIC
>
> The Coalesce Process is an application that does something like:
>
> Bucket = new list()
> Consumer = createConsumer()
> While( message = Cosumer.next() ) {
>     Window = calculate current window
>    If message is after Window:
>      Send Bucket to OUTPUT TOPIC
>   Else
>     Add message to Bucket
>
> }
>
> Dave Tauzell | Senior Software Engineer | Surescripts
> O: 651.855.3042 | www.surescripts.com<http://www.surescripts.com> |
> Dave.Tauzell@surescripts.com
> Connect with us: Twitter I LinkedIn I Facebook I YouTube
>
>
> -----Original Message-----
> From: Hamza HACHANI [mailto:hamza.hachani@supcom.tn]
> Sent: Friday, July 29, 2016 9:53 AM
> To: users@kafka.apache.org
> Subject: RE: Kafka streams Issue
>
> Hi Dave,
>
> Could you explain a little bit much your idea ?
> I can't figure out what you are suggesting.
> Thank you
>
> -Hamza
> ________________________________
> De : Tauzell, Dave <Da...@surescripts.com> Envoyé : vendredi 29
> juillet 2016 02:39:53 À : users@kafka.apache.org Objet : RE: Kafka
> streams Issue
>
> You could send the message immediately to an intermediary topic.  Then
> have a consumer of that topic that pull messages off and waits until the
> minute is up.
>
> -Dave
>
> Dave Tauzell | Senior Software Engineer | Surescripts
> O: 651.855.3042 | www.surescripts.com<http://www.surescripts.com> |
> Dave.Tauzell@surescripts.com
> Connect with us: Twitter I LinkedIn I Facebook I YouTube
>
>
> -----Original Message-----
> From: Hamza HACHANI [mailto:hamza.hachani@supcom.tn]
> Sent: Friday, July 29, 2016 9:36 AM
> To: users@kafka.apache.org
> Subject: Kafka streams Issue
>
> > Good morning,
> >
> > I'm an ICT student in TELECOM BRRETAGNE (a french school).
> > I did follow your presentation in Youtube and i found them really
> > intresting.
> > I'm trying to do some stuffs with Kafka. And now it has been  about 3
> > days that I'm blocked.
> > I'm trying to control the time in which my processing application send
> > data to the output topic .
> > What i'm trying to do is to make the application process data from the
> > input topic all the time but send the messages only at the end of a
> > minute/an hour/a month .... (the notion of windowing).
> > For the moment what i managed to do is that the application instead of
> > sending data only at the end of the minute,it send it anytime it does
> > receive it from the input topic.
> > Have you any suggestions to help me?
> > I would be really gratfeul.
>
>
> Preliminary answer for now:
>
> > For the moment what i managed to do is that the application instead of
> sending data only at the end
> > of the minute,it send it anytime it does receive it from the input topic.
>
> This is actually the expected behavior at the moment.
>
> The main reason for this behavior is that, in stream processing, we never
> know whether there is still late-arriving data to be received.  For
> example, imagine you have 1-minute windows based on event-time.  Here, it
> may happen that, after the first 1 minute window has passed, another record
> arrives five minutes later but, according to the record's event-time, it
> should have still been part of the first 1-minute window.  In this case,
> what we typically want to happen is that the first 1-window will be
> updated/reprocessed with the late-arriving record included.  In other
> words, just because 1 minute has passed (= the 1-minute window is "done")
> it does not mean that actually all the data for that time interval has been
> processed already -- so sending only a single update after 1 minute has
> passed would even produce incorrect results in many cases.  For this reason
> you currently see a downstream update anytime there is a new incoming data
> record ("send it anytime it does receive it from the input topic").  So the
> point here is due ensure correctness of processing.
>
> That said, one known drawback of the current behavior is that users
> haven't been able to control (read: decrease/reduce) the rate/volume of the
> resulting downstream updates.  For example, if you have an input topic with
> a rate of 1 million msg/s (which is easy for Kafka), some users want to
> aggregate/window results primarily to reduce the input rate to a lower
> numbers (e.g. 1 thousand msg/s) so that the data can be fed from Kafka to
> other systems that might not scale as well as Kafka.  To help these use
> cases we will have a new configuration parameter in the next major version
> of Kafka that allows you to control the rate/volume of downstream updates.
> Here, the point is to help users optimize resource usage rather than
> correctness of processing.  This new parameter should also help you with
> your use case.  But even this new parameter is not based on strict time
> behavior or time windows.
>
> This e-mail and any files transmitted with it are confidential, may
> contain sensitive information, and are intended solely for the use of the
> individual or entity to whom they are addressed. If you have received this
> e-mail in error, please notify the sender by reply e-mail immediately and
> destroy all copies of the e-mail and any attachments.
>



-- 
-- Guozhang

RE: Kafka streams Issue

Posted by Hamza HACHANI <ha...@supcom.tn>.
Thanks i will try that.


Hamza

________________________________
De : Tauzell, Dave <Da...@surescripts.com>
Envoyé : vendredi 29 juillet 2016 03:18:47
À : users@kafka.apache.org
Objet : RE: Kafka streams Issue

Let's say you currently have:

Procesing App    ---> OUTPUT TOPIC ---> output consumer

You would ideally like the processing app to only write to the output topic every minute, but cannot easily do this.  So what you might be able to do is:


Processing App ---> INTERMIDIATE OUTPUT TOPIC --->  Coalesce Process --->>= OUTPUT TOPIC

The Coalesce Process is an application that does something like:

Bucket = new list()
Consumer = createConsumer()
While( message = Cosumer.next() ) {
    Window = calculate current window
   If message is after Window:
     Send Bucket to OUTPUT TOPIC
  Else
    Add message to Bucket

}

Dave Tauzell | Senior Software Engineer | Surescripts
O: 651.855.3042 | www.surescripts.com<http://www.surescripts.com> |   Dave.Tauzell@surescripts.com
Connect with us: Twitter I LinkedIn I Facebook I YouTube


-----Original Message-----
From: Hamza HACHANI [mailto:hamza.hachani@supcom.tn]
Sent: Friday, July 29, 2016 9:53 AM
To: users@kafka.apache.org
Subject: RE: Kafka streams Issue

Hi Dave,

Could you explain a little bit much your idea ?
I can't figure out what you are suggesting.
Thank you

-Hamza
________________________________
De : Tauzell, Dave <Da...@surescripts.com> Envoyé : vendredi 29 juillet 2016 02:39:53 À : users@kafka.apache.org Objet : RE: Kafka streams Issue

You could send the message immediately to an intermediary topic.  Then have a consumer of that topic that pull messages off and waits until the minute is up.

-Dave

Dave Tauzell | Senior Software Engineer | Surescripts
O: 651.855.3042 | www.surescripts.com<http://www.surescripts.com> |   Dave.Tauzell@surescripts.com
Connect with us: Twitter I LinkedIn I Facebook I YouTube


-----Original Message-----
From: Hamza HACHANI [mailto:hamza.hachani@supcom.tn]
Sent: Friday, July 29, 2016 9:36 AM
To: users@kafka.apache.org
Subject: Kafka streams Issue

> Good morning,
>
> I'm an ICT student in TELECOM BRRETAGNE (a french school).
> I did follow your presentation in Youtube and i found them really
> intresting.
> I'm trying to do some stuffs with Kafka. And now it has been  about 3
> days that I'm blocked.
> I'm trying to control the time in which my processing application send
> data to the output topic .
> What i'm trying to do is to make the application process data from the
> input topic all the time but send the messages only at the end of a
> minute/an hour/a month .... (the notion of windowing).
> For the moment what i managed to do is that the application instead of
> sending data only at the end of the minute,it send it anytime it does
> receive it from the input topic.
> Have you any suggestions to help me?
> I would be really gratfeul.


Preliminary answer for now:

> For the moment what i managed to do is that the application instead of
sending data only at the end
> of the minute,it send it anytime it does receive it from the input topic.

This is actually the expected behavior at the moment.

The main reason for this behavior is that, in stream processing, we never know whether there is still late-arriving data to be received.  For example, imagine you have 1-minute windows based on event-time.  Here, it may happen that, after the first 1 minute window has passed, another record arrives five minutes later but, according to the record's event-time, it should have still been part of the first 1-minute window.  In this case, what we typically want to happen is that the first 1-window will be updated/reprocessed with the late-arriving record included.  In other words, just because 1 minute has passed (= the 1-minute window is "done") it does not mean that actually all the data for that time interval has been processed already -- so sending only a single update after 1 minute has passed would even produce incorrect results in many cases.  For this reason you currently see a downstream update anytime there is a new incoming data record ("send it anytime it does receive it from the input topic").  So the point here is due ensure correctness of processing.

That said, one known drawback of the current behavior is that users haven't been able to control (read: decrease/reduce) the rate/volume of the resulting downstream updates.  For example, if you have an input topic with a rate of 1 million msg/s (which is easy for Kafka), some users want to aggregate/window results primarily to reduce the input rate to a lower numbers (e.g. 1 thousand msg/s) so that the data can be fed from Kafka to other systems that might not scale as well as Kafka.  To help these use cases we will have a new configuration parameter in the next major version of Kafka that allows you to control the rate/volume of downstream updates.
Here, the point is to help users optimize resource usage rather than correctness of processing.  This new parameter should also help you with your use case.  But even this new parameter is not based on strict time behavior or time windows.

This e-mail and any files transmitted with it are confidential, may contain sensitive information, and are intended solely for the use of the individual or entity to whom they are addressed. If you have received this e-mail in error, please notify the sender by reply e-mail immediately and destroy all copies of the e-mail and any attachments.

RE: Kafka streams Issue

Posted by "Tauzell, Dave" <Da...@surescripts.com>.
Let's say you currently have:

Procesing App    ---> OUTPUT TOPIC ---> output consumer

You would ideally like the processing app to only write to the output topic every minute, but cannot easily do this.  So what you might be able to do is:


Processing App ---> INTERMIDIATE OUTPUT TOPIC --->  Coalesce Process --->>= OUTPUT TOPIC

The Coalesce Process is an application that does something like:

Bucket = new list()
Consumer = createConsumer()
While( message = Cosumer.next() ) {
    Window = calculate current window
   If message is after Window:
     Send Bucket to OUTPUT TOPIC
  Else
    Add message to Bucket

}

Dave Tauzell | Senior Software Engineer | Surescripts
O: 651.855.3042 | www.surescripts.com |   Dave.Tauzell@surescripts.com
Connect with us: Twitter I LinkedIn I Facebook I YouTube


-----Original Message-----
From: Hamza HACHANI [mailto:hamza.hachani@supcom.tn] 
Sent: Friday, July 29, 2016 9:53 AM
To: users@kafka.apache.org
Subject: RE: Kafka streams Issue

Hi Dave,

Could you explain a little bit much your idea ?
I can't figure out what you are suggesting.
Thank you

-Hamza
________________________________
De : Tauzell, Dave <Da...@surescripts.com> Envoyé : vendredi 29 juillet 2016 02:39:53 À : users@kafka.apache.org Objet : RE: Kafka streams Issue

You could send the message immediately to an intermediary topic.  Then have a consumer of that topic that pull messages off and waits until the minute is up.

-Dave

Dave Tauzell | Senior Software Engineer | Surescripts
O: 651.855.3042 | www.surescripts.com<http://www.surescripts.com> |   Dave.Tauzell@surescripts.com
Connect with us: Twitter I LinkedIn I Facebook I YouTube


-----Original Message-----
From: Hamza HACHANI [mailto:hamza.hachani@supcom.tn]
Sent: Friday, July 29, 2016 9:36 AM
To: users@kafka.apache.org
Subject: Kafka streams Issue

> Good morning,
>
> I'm an ICT student in TELECOM BRRETAGNE (a french school).
> I did follow your presentation in Youtube and i found them really 
> intresting.
> I'm trying to do some stuffs with Kafka. And now it has been  about 3 
> days that I'm blocked.
> I'm trying to control the time in which my processing application send 
> data to the output topic .
> What i'm trying to do is to make the application process data from the 
> input topic all the time but send the messages only at the end of a 
> minute/an hour/a month .... (the notion of windowing).
> For the moment what i managed to do is that the application instead of 
> sending data only at the end of the minute,it send it anytime it does 
> receive it from the input topic.
> Have you any suggestions to help me?
> I would be really gratfeul.


Preliminary answer for now:

> For the moment what i managed to do is that the application instead of
sending data only at the end
> of the minute,it send it anytime it does receive it from the input topic.

This is actually the expected behavior at the moment.

The main reason for this behavior is that, in stream processing, we never know whether there is still late-arriving data to be received.  For example, imagine you have 1-minute windows based on event-time.  Here, it may happen that, after the first 1 minute window has passed, another record arrives five minutes later but, according to the record's event-time, it should have still been part of the first 1-minute window.  In this case, what we typically want to happen is that the first 1-window will be updated/reprocessed with the late-arriving record included.  In other words, just because 1 minute has passed (= the 1-minute window is "done") it does not mean that actually all the data for that time interval has been processed already -- so sending only a single update after 1 minute has passed would even produce incorrect results in many cases.  For this reason you currently see a downstream update anytime there is a new incoming data record ("send it anytime it does receive it from the input topic").  So the point here is due ensure correctness of processing.

That said, one known drawback of the current behavior is that users haven't been able to control (read: decrease/reduce) the rate/volume of the resulting downstream updates.  For example, if you have an input topic with a rate of 1 million msg/s (which is easy for Kafka), some users want to aggregate/window results primarily to reduce the input rate to a lower numbers (e.g. 1 thousand msg/s) so that the data can be fed from Kafka to other systems that might not scale as well as Kafka.  To help these use cases we will have a new configuration parameter in the next major version of Kafka that allows you to control the rate/volume of downstream updates.
Here, the point is to help users optimize resource usage rather than correctness of processing.  This new parameter should also help you with your use case.  But even this new parameter is not based on strict time behavior or time windows.

This e-mail and any files transmitted with it are confidential, may contain sensitive information, and are intended solely for the use of the individual or entity to whom they are addressed. If you have received this e-mail in error, please notify the sender by reply e-mail immediately and destroy all copies of the e-mail and any attachments.

RE: Kafka streams Issue

Posted by Hamza HACHANI <ha...@supcom.tn>.
Hi Dave,

Could you explain a little bit much your idea ?
I can't figure out what you are suggesting.
Thank you

-Hamza
________________________________
De : Tauzell, Dave <Da...@surescripts.com>
Envoyé : vendredi 29 juillet 2016 02:39:53
À : users@kafka.apache.org
Objet : RE: Kafka streams Issue

You could send the message immediately to an intermediary topic.  Then have a consumer of that topic that pull messages off and waits until the minute is up.

-Dave

Dave Tauzell | Senior Software Engineer | Surescripts
O: 651.855.3042 | www.surescripts.com<http://www.surescripts.com> |   Dave.Tauzell@surescripts.com
Connect with us: Twitter I LinkedIn I Facebook I YouTube


-----Original Message-----
From: Hamza HACHANI [mailto:hamza.hachani@supcom.tn]
Sent: Friday, July 29, 2016 9:36 AM
To: users@kafka.apache.org
Subject: Kafka streams Issue

> Good morning,
>
> I'm an ICT student in TELECOM BRRETAGNE (a french school).
> I did follow your presentation in Youtube and i found them really
> intresting.
> I'm trying to do some stuffs with Kafka. And now it has been  about 3
> days that I'm blocked.
> I'm trying to control the time in which my processing application send
> data to the output topic .
> What i'm trying to do is to make the application process data from the
> input topic all the time but send the messages only at the end of a
> minute/an hour/a month .... (the notion of windowing).
> For the moment what i managed to do is that the application instead of
> sending data only at the end of the minute,it send it anytime it does
> receive it from the input topic.
> Have you any suggestions to help me?
> I would be really gratfeul.


Preliminary answer for now:

> For the moment what i managed to do is that the application instead of
sending data only at the end
> of the minute,it send it anytime it does receive it from the input topic.

This is actually the expected behavior at the moment.

The main reason for this behavior is that, in stream processing, we never know whether there is still late-arriving data to be received.  For example, imagine you have 1-minute windows based on event-time.  Here, it may happen that, after the first 1 minute window has passed, another record arrives five minutes later but, according to the record's event-time, it should have still been part of the first 1-minute window.  In this case, what we typically want to happen is that the first 1-window will be updated/reprocessed with the late-arriving record included.  In other words, just because 1 minute has passed (= the 1-minute window is "done") it does not mean that actually all the data for that time interval has been processed already -- so sending only a single update after 1 minute has passed would even produce incorrect results in many cases.  For this reason you currently see a downstream update anytime there is a new incoming data record ("send it anytime it does receive it from the input topic").  So the point here is due ensure correctness of processing.

That said, one known drawback of the current behavior is that users haven't been able to control (read: decrease/reduce) the rate/volume of the resulting downstream updates.  For example, if you have an input topic with a rate of 1 million msg/s (which is easy for Kafka), some users want to aggregate/window results primarily to reduce the input rate to a lower numbers (e.g. 1 thousand msg/s) so that the data can be fed from Kafka to other systems that might not scale as well as Kafka.  To help these use cases we will have a new configuration parameter in the next major version of Kafka that allows you to control the rate/volume of downstream updates.
Here, the point is to help users optimize resource usage rather than correctness of processing.  This new parameter should also help you with your use case.  But even this new parameter is not based on strict time behavior or time windows.

This e-mail and any files transmitted with it are confidential, may contain sensitive information, and are intended solely for the use of the individual or entity to whom they are addressed. If you have received this e-mail in error, please notify the sender by reply e-mail immediately and destroy all copies of the e-mail and any attachments.

RE: Kafka streams Issue

Posted by "Tauzell, Dave" <Da...@surescripts.com>.
You could send the message immediately to an intermediary topic.  Then have a consumer of that topic that pull messages off and waits until the minute is up.

-Dave

Dave Tauzell | Senior Software Engineer | Surescripts
O: 651.855.3042 | www.surescripts.com |   Dave.Tauzell@surescripts.com
Connect with us: Twitter I LinkedIn I Facebook I YouTube


-----Original Message-----
From: Hamza HACHANI [mailto:hamza.hachani@supcom.tn]
Sent: Friday, July 29, 2016 9:36 AM
To: users@kafka.apache.org
Subject: Kafka streams Issue

> Good morning,
>
> I'm an ICT student in TELECOM BRRETAGNE (a french school).
> I did follow your presentation in Youtube and i found them really
> intresting.
> I'm trying to do some stuffs with Kafka. And now it has been  about 3
> days that I'm blocked.
> I'm trying to control the time in which my processing application send
> data to the output topic .
> What i'm trying to do is to make the application process data from the
> input topic all the time but send the messages only at the end of a
> minute/an hour/a month .... (the notion of windowing).
> For the moment what i managed to do is that the application instead of
> sending data only at the end of the minute,it send it anytime it does
> receive it from the input topic.
> Have you any suggestions to help me?
> I would be really gratfeul.


Preliminary answer for now:

> For the moment what i managed to do is that the application instead of
sending data only at the end
> of the minute,it send it anytime it does receive it from the input topic.

This is actually the expected behavior at the moment.

The main reason for this behavior is that, in stream processing, we never know whether there is still late-arriving data to be received.  For example, imagine you have 1-minute windows based on event-time.  Here, it may happen that, after the first 1 minute window has passed, another record arrives five minutes later but, according to the record's event-time, it should have still been part of the first 1-minute window.  In this case, what we typically want to happen is that the first 1-window will be updated/reprocessed with the late-arriving record included.  In other words, just because 1 minute has passed (= the 1-minute window is "done") it does not mean that actually all the data for that time interval has been processed already -- so sending only a single update after 1 minute has passed would even produce incorrect results in many cases.  For this reason you currently see a downstream update anytime there is a new incoming data record ("send it anytime it does receive it from the input topic").  So the point here is due ensure correctness of processing.

That said, one known drawback of the current behavior is that users haven't been able to control (read: decrease/reduce) the rate/volume of the resulting downstream updates.  For example, if you have an input topic with a rate of 1 million msg/s (which is easy for Kafka), some users want to aggregate/window results primarily to reduce the input rate to a lower numbers (e.g. 1 thousand msg/s) so that the data can be fed from Kafka to other systems that might not scale as well as Kafka.  To help these use cases we will have a new configuration parameter in the next major version of Kafka that allows you to control the rate/volume of downstream updates.
Here, the point is to help users optimize resource usage rather than correctness of processing.  This new parameter should also help you with your use case.  But even this new parameter is not based on strict time behavior or time windows.

This e-mail and any files transmitted with it are confidential, may contain sensitive information, and are intended solely for the use of the individual or entity to whom they are addressed. If you have received this e-mail in error, please notify the sender by reply e-mail immediately and destroy all copies of the e-mail and any attachments.