You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Biplob Biswas <re...@gmail.com> on 2017/07/24 13:30:23 UTC

Conflict resolution for data in spark streaming

Hi,

I have a situation where updates are coming from 2 different data sources,
this data at times are arriving in the same batch defined in streaming
context duration parameter of 500 ms  (recommended in spark according to
the documentation).

Now that is not the problem, the problem is that when the data is
partitioned to different executors, the order in which it originally
arrived, it's not processed in the same order, this I know because the
event data which comes last should be used for the updated state. This kind
of race condition exists and is not consistent.

Has anyone any idea to fix this issue? I am not really sure if anyone faced
this kind of any issue and if someone fixed anything like this?

Thanks & Regards
Biplob Biswas