You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@flink.apache.org by burgesschen <tc...@bloomberg.net> on 2017/07/12 18:22:35 UTC

sanity check in production

Hello everyone, 

Our team ran into an issue that testing new deployment of flink job is
difficult as explained below 


Goal: 
When we are deploying new version of a flink job in production. we want to
be able to have the job process some test messages and verify the output to
make sure that the job is running correctly. (sanity check) 

Problem: 
The tests messages interfere with the watermark of the flink job,
potentially causing it dropping real messages. 

Possible solutions: 
1. have a separate watermark for the test messages 
  (looks not supported by the current framework) 

2. run a separate Flink job (same code) in production for sanity check
before actual deployment 
  (high operational costs) 

3. cancel the running production job with a save point, run a new job with
the save point, do sanity check and mess up the watermark of the new job,
kill the new job, do actual deployment with the same save point. 
  (high operational costs) 

Any idea is appreciated, thanks! 



--
View this message in context: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/sanity-check-in-production-tp14229.html
Sent from the Apache Flink User Mailing List archive. mailing list archive at Nabble.com.

Re: sanity check in production

Posted by Gyula Fóra <gy...@gmail.com>.
Hi!

Assuming you have some spare compute resources on your cluster (which you
should have in a production setting to be safe). I think 2) would be the
best option, ideally started from a savepoint of the production job to
verify your state logic as well.

You could also run the test job on a smaller parallelism setting, and
verify that it actually works, then maybe run some live data through it as
well before killing of the test job and updating the prod job.

Even though this might have a fairly high temporary cost I think it is
ultimately worth it to test on live data before upgrading the production
job.

Cheers,
Gyula

burgesschen <tc...@bloomberg.net> ezt írta (időpont: 2017. júl. 12.,
Sze, 20:49):

> Hello everyone,
>
> Our team ran into an issue that testing new deployment of flink job is
> difficult as explained below
>
>
> Goal:
> When we are deploying new version of a flink job in production. we want to
> be able to have the job process some test messages and verify the output to
> make sure that the job is running correctly. (sanity check)
>
> Problem:
> The tests messages interfere with the watermark of the flink job,
> potentially causing it dropping real messages.
>
> Possible solutions:
> 1. have a separate watermark for the test messages
>   (looks not supported by the current framework)
>
> 2. run a separate Flink job (same code) in production for sanity check
> before actual deployment
>   (high operational costs)
>
> 3. cancel the running production job with a save point, run a new job with
> the save point, do sanity check and mess up the watermark of the new job,
> kill the new job, do actual deployment with the same save point.
>   (high operational costs)
>
> Any idea is appreciated, thanks!
>
>
>
> --
> View this message in context:
> http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/sanity-check-in-production-tp14229.html
> Sent from the Apache Flink User Mailing List archive. mailing list archive
> at Nabble.com.
>