You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@flink.apache.org by Harshvardhan Agrawal <ha...@gmail.com> on 2018/07/27 16:11:34 UTC

Order of events in a Keyed Stream

Hi,


We are currently using Flink to process financial data. We are getting
position data from Kafka and we enrich the positions with account and
product information. We are using Ingestion time while processing events.
The question I have is: say I key the position datasream by account number.
If I have two consecutive Kafka messages with the same account and product
info where the second one is an updated position of the first one, does
Flink guarantee that the messages will be processed on the same slot in the
same worker? We want to ensure that we don’t process them out of order.

Thank you!
-- 
Regards,
Harshvardhan

Re: Order of events in a Keyed Stream

Posted by Harshvardhan Agrawal <ha...@gmail.com>.
Thanks for the response guys.

Based on Niels response, it seems like a keyby immediately after reading
from the source should map all messages with the account number on the same
slot.

On Sun, Jul 29, 2018 at 05:33 Renjie Liu <li...@gmail.com> wrote:

> Hi,
> Another way to ensure order is by adding a logical version number for each
> message so that earlier version will not override later version. Timestamp
> depends on your ntp server works correctly.
>
> On Sun, Jul 29, 2018 at 3:52 PM Niels Basjes <Ni...@basjes.nl> wrote:
>
>> Hi,
>>
>> The basic thing is that you will only get the messages in a guaranteed
>> order if the order is maintained in all steps from creation to use.
>> In Kafka order is only guaranteed for messages in the same partition.
>> So if you need them in order by account then the producing system must
>> use the accountid as the key used to force a specific account into a
>> specific kafka partition.
>> Then the Flink Kafka source will read them sequentially in the right
>> order, but in order to KEEP them in that order you should really to a keyby
>> immediately after reading and used only keyedstreams from that point
>> onwards.
>> As soon as you do shuffle or key by a different key then the ordering
>> within an account is no longer guaranteed.
>>
>> In general I always put a very accurate timestamp in all of my events
>> (epoch milliseconds, in some cases even epoch microseconds) so I can always
>> check if an order problem occurred.
>>
>> Niels Basjes
>>
>> On Sun, Jul 29, 2018 at 9:25 AM, Congxian Qiu <qc...@gmail.com>
>> wrote:
>>
>>> Hi,
>>> Maybe the messages of the same key should be in the *same partition* of
>>> Kafka topic
>>>
>>> 2018-07-29 11:01 GMT+08:00 Hequn Cheng <ch...@gmail.com>:
>>>
>>>> Hi harshvardhan,
>>>> If 1.the messages exist on the same topic and 2.there are no rebalance
>>>> and 3.keyby on the same field with same value, the answer is yes.
>>>>
>>>> Best, Hequn
>>>>
>>>> On Sun, Jul 29, 2018 at 3:56 AM, Harshvardhan Agrawal <
>>>> harshvardhan.agr93@gmail.com> wrote:
>>>>
>>>>> Hey,
>>>>>
>>>>> The messages will exist on the same topic. I intend to keyby on the
>>>>> same field. The question is that will the two messages be mapped to the
>>>>> same task manager and on the same slot. Also will they be processed in
>>>>> correct order given they have the same keys?
>>>>>
>>>>> On Fri, Jul 27, 2018 at 21:28 Hequn Cheng <ch...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Hi Harshvardhan,
>>>>>>
>>>>>> There are a number of factors to consider.
>>>>>> 1. the consecutive Kafka messages must exist in a same topic of
>>>>>> kafka.
>>>>>> 2. the data should not been rebalanced. For example, operators should
>>>>>> be chained in order to avoid rebalancing.
>>>>>> 3. if you perform keyBy(), you should keyBy on a field the consecutive
>>>>>> two messages share the same value.
>>>>>>
>>>>>> Best, Hequn
>>>>>>
>>>>>> On Sat, Jul 28, 2018 at 12:11 AM, Harshvardhan Agrawal <
>>>>>> harshvardhan.agr93@gmail.com> wrote:
>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>>
>>>>>>> We are currently using Flink to process financial data. We are
>>>>>>> getting position data from Kafka and we enrich the positions with account
>>>>>>> and product information. We are using Ingestion time while processing
>>>>>>> events. The question I have is: say I key the position datasream by account
>>>>>>> number. If I have two consecutive Kafka messages with the same account and
>>>>>>> product info where the second one is an updated position of the first one,
>>>>>>> does Flink guarantee that the messages will be processed on the same slot
>>>>>>> in the same worker? We want to ensure that we don’t process them out of
>>>>>>> order.
>>>>>>>
>>>>>>> Thank you!
>>>>>>> --
>>>>>>> Regards,
>>>>>>> Harshvardhan
>>>>>>>
>>>>>>
>>>>>> --
>>>>> Regards,
>>>>> Harshvardhan
>>>>>
>>>>
>>>>
>>>
>>>
>>> --
>>> Blog:http://www.klion26.com
>>> GTalk:qcx978132955
>>> 一切随心
>>>
>>
>>
>>
>> --
>> Best regards / Met vriendelijke groeten,
>>
>> Niels Basjes
>>
> --
> Liu, Renjie
> Software Engineer, MVAD
>
-- 
Regards,
Harshvardhan

Re: Order of events in a Keyed Stream

Posted by Renjie Liu <li...@gmail.com>.
Hi,
Another way to ensure order is by adding a logical version number for each
message so that earlier version will not override later version. Timestamp
depends on your ntp server works correctly.

On Sun, Jul 29, 2018 at 3:52 PM Niels Basjes <Ni...@basjes.nl> wrote:

> Hi,
>
> The basic thing is that you will only get the messages in a guaranteed
> order if the order is maintained in all steps from creation to use.
> In Kafka order is only guaranteed for messages in the same partition.
> So if you need them in order by account then the producing system must use
> the accountid as the key used to force a specific account into a specific
> kafka partition.
> Then the Flink Kafka source will read them sequentially in the right
> order, but in order to KEEP them in that order you should really to a keyby
> immediately after reading and used only keyedstreams from that point
> onwards.
> As soon as you do shuffle or key by a different key then the ordering
> within an account is no longer guaranteed.
>
> In general I always put a very accurate timestamp in all of my events
> (epoch milliseconds, in some cases even epoch microseconds) so I can always
> check if an order problem occurred.
>
> Niels Basjes
>
> On Sun, Jul 29, 2018 at 9:25 AM, Congxian Qiu <qc...@gmail.com>
> wrote:
>
>> Hi,
>> Maybe the messages of the same key should be in the *same partition* of
>> Kafka topic
>>
>> 2018-07-29 11:01 GMT+08:00 Hequn Cheng <ch...@gmail.com>:
>>
>>> Hi harshvardhan,
>>> If 1.the messages exist on the same topic and 2.there are no rebalance
>>> and 3.keyby on the same field with same value, the answer is yes.
>>>
>>> Best, Hequn
>>>
>>> On Sun, Jul 29, 2018 at 3:56 AM, Harshvardhan Agrawal <
>>> harshvardhan.agr93@gmail.com> wrote:
>>>
>>>> Hey,
>>>>
>>>> The messages will exist on the same topic. I intend to keyby on the
>>>> same field. The question is that will the two messages be mapped to the
>>>> same task manager and on the same slot. Also will they be processed in
>>>> correct order given they have the same keys?
>>>>
>>>> On Fri, Jul 27, 2018 at 21:28 Hequn Cheng <ch...@gmail.com> wrote:
>>>>
>>>>> Hi Harshvardhan,
>>>>>
>>>>> There are a number of factors to consider.
>>>>> 1. the consecutive Kafka messages must exist in a same topic of kafka.
>>>>> 2. the data should not been rebalanced. For example, operators should
>>>>> be chained in order to avoid rebalancing.
>>>>> 3. if you perform keyBy(), you should keyBy on a field the consecutive
>>>>> two messages share the same value.
>>>>>
>>>>> Best, Hequn
>>>>>
>>>>> On Sat, Jul 28, 2018 at 12:11 AM, Harshvardhan Agrawal <
>>>>> harshvardhan.agr93@gmail.com> wrote:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>>
>>>>>> We are currently using Flink to process financial data. We are
>>>>>> getting position data from Kafka and we enrich the positions with account
>>>>>> and product information. We are using Ingestion time while processing
>>>>>> events. The question I have is: say I key the position datasream by account
>>>>>> number. If I have two consecutive Kafka messages with the same account and
>>>>>> product info where the second one is an updated position of the first one,
>>>>>> does Flink guarantee that the messages will be processed on the same slot
>>>>>> in the same worker? We want to ensure that we don’t process them out of
>>>>>> order.
>>>>>>
>>>>>> Thank you!
>>>>>> --
>>>>>> Regards,
>>>>>> Harshvardhan
>>>>>>
>>>>>
>>>>> --
>>>> Regards,
>>>> Harshvardhan
>>>>
>>>
>>>
>>
>>
>> --
>> Blog:http://www.klion26.com
>> GTalk:qcx978132955
>> 一切随心
>>
>
>
>
> --
> Best regards / Met vriendelijke groeten,
>
> Niels Basjes
>
-- 
Liu, Renjie
Software Engineer, MVAD

Re: Order of events in a Keyed Stream

Posted by Niels Basjes <Ni...@basjes.nl>.
Hi,

The basic thing is that you will only get the messages in a guaranteed
order if the order is maintained in all steps from creation to use.
In Kafka order is only guaranteed for messages in the same partition.
So if you need them in order by account then the producing system must use
the accountid as the key used to force a specific account into a specific
kafka partition.
Then the Flink Kafka source will read them sequentially in the right order,
but in order to KEEP them in that order you should really to a keyby
immediately after reading and used only keyedstreams from that point
onwards.
As soon as you do shuffle or key by a different key then the ordering
within an account is no longer guaranteed.

In general I always put a very accurate timestamp in all of my events
(epoch milliseconds, in some cases even epoch microseconds) so I can always
check if an order problem occurred.

Niels Basjes

On Sun, Jul 29, 2018 at 9:25 AM, Congxian Qiu <qc...@gmail.com>
wrote:

> Hi,
> Maybe the messages of the same key should be in the *same partition* of
> Kafka topic
>
> 2018-07-29 11:01 GMT+08:00 Hequn Cheng <ch...@gmail.com>:
>
>> Hi harshvardhan,
>> If 1.the messages exist on the same topic and 2.there are no rebalance
>> and 3.keyby on the same field with same value, the answer is yes.
>>
>> Best, Hequn
>>
>> On Sun, Jul 29, 2018 at 3:56 AM, Harshvardhan Agrawal <
>> harshvardhan.agr93@gmail.com> wrote:
>>
>>> Hey,
>>>
>>> The messages will exist on the same topic. I intend to keyby on the same
>>> field. The question is that will the two messages be mapped to the same
>>> task manager and on the same slot. Also will they be processed in correct
>>> order given they have the same keys?
>>>
>>> On Fri, Jul 27, 2018 at 21:28 Hequn Cheng <ch...@gmail.com> wrote:
>>>
>>>> Hi Harshvardhan,
>>>>
>>>> There are a number of factors to consider.
>>>> 1. the consecutive Kafka messages must exist in a same topic of kafka.
>>>> 2. the data should not been rebalanced. For example, operators should
>>>> be chained in order to avoid rebalancing.
>>>> 3. if you perform keyBy(), you should keyBy on a field the consecutive
>>>> two messages share the same value.
>>>>
>>>> Best, Hequn
>>>>
>>>> On Sat, Jul 28, 2018 at 12:11 AM, Harshvardhan Agrawal <
>>>> harshvardhan.agr93@gmail.com> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>>
>>>>> We are currently using Flink to process financial data. We are getting
>>>>> position data from Kafka and we enrich the positions with account and
>>>>> product information. We are using Ingestion time while processing events.
>>>>> The question I have is: say I key the position datasream by account number.
>>>>> If I have two consecutive Kafka messages with the same account and product
>>>>> info where the second one is an updated position of the first one, does
>>>>> Flink guarantee that the messages will be processed on the same slot in the
>>>>> same worker? We want to ensure that we don’t process them out of order.
>>>>>
>>>>> Thank you!
>>>>> --
>>>>> Regards,
>>>>> Harshvardhan
>>>>>
>>>>
>>>> --
>>> Regards,
>>> Harshvardhan
>>>
>>
>>
>
>
> --
> Blog:http://www.klion26.com
> GTalk:qcx978132955
> 一切随心
>



-- 
Best regards / Met vriendelijke groeten,

Niels Basjes

Re: Order of events in a Keyed Stream

Posted by Congxian Qiu <qc...@gmail.com>.
Hi,
Maybe the messages of the same key should be in the *same partition* of
Kafka topic

2018-07-29 11:01 GMT+08:00 Hequn Cheng <ch...@gmail.com>:

> Hi harshvardhan,
> If 1.the messages exist on the same topic and 2.there are no rebalance and
> 3.keyby on the same field with same value, the answer is yes.
>
> Best, Hequn
>
> On Sun, Jul 29, 2018 at 3:56 AM, Harshvardhan Agrawal <
> harshvardhan.agr93@gmail.com> wrote:
>
>> Hey,
>>
>> The messages will exist on the same topic. I intend to keyby on the same
>> field. The question is that will the two messages be mapped to the same
>> task manager and on the same slot. Also will they be processed in correct
>> order given they have the same keys?
>>
>> On Fri, Jul 27, 2018 at 21:28 Hequn Cheng <ch...@gmail.com> wrote:
>>
>>> Hi Harshvardhan,
>>>
>>> There are a number of factors to consider.
>>> 1. the consecutive Kafka messages must exist in a same topic of kafka.
>>> 2. the data should not been rebalanced. For example, operators should be
>>> chained in order to avoid rebalancing.
>>> 3. if you perform keyBy(), you should keyBy on a field the consecutive
>>> two messages share the same value.
>>>
>>> Best, Hequn
>>>
>>> On Sat, Jul 28, 2018 at 12:11 AM, Harshvardhan Agrawal <
>>> harshvardhan.agr93@gmail.com> wrote:
>>>
>>>> Hi,
>>>>
>>>>
>>>> We are currently using Flink to process financial data. We are getting
>>>> position data from Kafka and we enrich the positions with account and
>>>> product information. We are using Ingestion time while processing events.
>>>> The question I have is: say I key the position datasream by account number.
>>>> If I have two consecutive Kafka messages with the same account and product
>>>> info where the second one is an updated position of the first one, does
>>>> Flink guarantee that the messages will be processed on the same slot in the
>>>> same worker? We want to ensure that we don’t process them out of order.
>>>>
>>>> Thank you!
>>>> --
>>>> Regards,
>>>> Harshvardhan
>>>>
>>>
>>> --
>> Regards,
>> Harshvardhan
>>
>
>


-- 
Blog:http://www.klion26.com
GTalk:qcx978132955
一切随心

Re: Order of events in a Keyed Stream

Posted by Hequn Cheng <ch...@gmail.com>.
Hi harshvardhan,
If 1.the messages exist on the same topic and 2.there are no rebalance and
3.keyby on the same field with same value, the answer is yes.

Best, Hequn

On Sun, Jul 29, 2018 at 3:56 AM, Harshvardhan Agrawal <
harshvardhan.agr93@gmail.com> wrote:

> Hey,
>
> The messages will exist on the same topic. I intend to keyby on the same
> field. The question is that will the two messages be mapped to the same
> task manager and on the same slot. Also will they be processed in correct
> order given they have the same keys?
>
> On Fri, Jul 27, 2018 at 21:28 Hequn Cheng <ch...@gmail.com> wrote:
>
>> Hi Harshvardhan,
>>
>> There are a number of factors to consider.
>> 1. the consecutive Kafka messages must exist in a same topic of kafka.
>> 2. the data should not been rebalanced. For example, operators should be
>> chained in order to avoid rebalancing.
>> 3. if you perform keyBy(), you should keyBy on a field the consecutive
>> two messages share the same value.
>>
>> Best, Hequn
>>
>> On Sat, Jul 28, 2018 at 12:11 AM, Harshvardhan Agrawal <
>> harshvardhan.agr93@gmail.com> wrote:
>>
>>> Hi,
>>>
>>>
>>> We are currently using Flink to process financial data. We are getting
>>> position data from Kafka and we enrich the positions with account and
>>> product information. We are using Ingestion time while processing events.
>>> The question I have is: say I key the position datasream by account number.
>>> If I have two consecutive Kafka messages with the same account and product
>>> info where the second one is an updated position of the first one, does
>>> Flink guarantee that the messages will be processed on the same slot in the
>>> same worker? We want to ensure that we don’t process them out of order.
>>>
>>> Thank you!
>>> --
>>> Regards,
>>> Harshvardhan
>>>
>>
>> --
> Regards,
> Harshvardhan
>

Re: Order of events in a Keyed Stream

Posted by Harshvardhan Agrawal <ha...@gmail.com>.
Hey,

The messages will exist on the same topic. I intend to keyby on the same
field. The question is that will the two messages be mapped to the same
task manager and on the same slot. Also will they be processed in correct
order given they have the same keys?

On Fri, Jul 27, 2018 at 21:28 Hequn Cheng <ch...@gmail.com> wrote:

> Hi Harshvardhan,
>
> There are a number of factors to consider.
> 1. the consecutive Kafka messages must exist in a same topic of kafka.
> 2. the data should not been rebalanced. For example, operators should be
> chained in order to avoid rebalancing.
> 3. if you perform keyBy(), you should keyBy on a field the consecutive
> two messages share the same value.
>
> Best, Hequn
>
> On Sat, Jul 28, 2018 at 12:11 AM, Harshvardhan Agrawal <
> harshvardhan.agr93@gmail.com> wrote:
>
>> Hi,
>>
>>
>> We are currently using Flink to process financial data. We are getting
>> position data from Kafka and we enrich the positions with account and
>> product information. We are using Ingestion time while processing events.
>> The question I have is: say I key the position datasream by account number.
>> If I have two consecutive Kafka messages with the same account and product
>> info where the second one is an updated position of the first one, does
>> Flink guarantee that the messages will be processed on the same slot in the
>> same worker? We want to ensure that we don’t process them out of order.
>>
>> Thank you!
>> --
>> Regards,
>> Harshvardhan
>>
>
> --
Regards,
Harshvardhan

Re: Order of events in a Keyed Stream

Posted by Hequn Cheng <ch...@gmail.com>.
Hi Harshvardhan,

There are a number of factors to consider.
1. the consecutive Kafka messages must exist in a same topic of kafka.
2. the data should not been rebalanced. For example, operators should be
chained in order to avoid rebalancing.
3. if you perform keyBy(), you should keyBy on a field the consecutive two
messages share the same value.

Best, Hequn

On Sat, Jul 28, 2018 at 12:11 AM, Harshvardhan Agrawal <
harshvardhan.agr93@gmail.com> wrote:

> Hi,
>
>
> We are currently using Flink to process financial data. We are getting
> position data from Kafka and we enrich the positions with account and
> product information. We are using Ingestion time while processing events.
> The question I have is: say I key the position datasream by account number.
> If I have two consecutive Kafka messages with the same account and product
> info where the second one is an updated position of the first one, does
> Flink guarantee that the messages will be processed on the same slot in the
> same worker? We want to ensure that we don’t process them out of order.
>
> Thank you!
> --
> Regards,
> Harshvardhan
>