You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@flink.apache.org by cokle <so...@gmail.com> on 2020/09/20 11:13:20 UTC

Fault tolerance: StickyAllocationAndLocalRecoveryTestJob

Hello members,
I am new to the Apache Flink word and in the last month, I have been
exploring the testing scenarios offered by Flink team and different books to
learn Flink.

Today I was trying to better understand this test that you can find it here:

http://github.com/apache/flink/blob/master/flink-end-to-end-tests/flink-local-recovery-and-allocation-test/src/main/java/org/apache/flink/streaming/tests/StickyAllocationAndLocalRecoveryTestJob.java
<http://github.com/apache/flink/blob/master/flink-end-to-end-tests/flink-local-recovery-and-allocation-test/src/main/java/org/apache/flink/streaming/tests/StickyAllocationAndLocalRecoveryTestJob.java>  

I will try to explain how I understand it and maybe you can point out the
problems of my logic.

Testing parameters:
       delay = 1L;
       maxAttempts = 3;
       stateBackend = FsStateBackend


/*RandomLongSource*/:
Firstly we will create data source implementing the CheckpointedFunction
interface.  If the number of attempts is higher than the number of max
allowed attempts, we will emit the last event and shut down the source,
otherwise, we will continue emitting events.
 1.1/ Why we need the maxAttempts in this scenario? Is that the number of
times we allow the application to fail?/
/*initializeState*/method is called every time the user-defined function is
initialized, or be that when the function is actually recovering from an
earlier checkpoint. [1]



*/StateCreatingFlatMap/*:
After implementing the source, with the flat map operator, we are going to
generate failure scenarios and test how flink will handle situations. We are
going to kill TaskManagers using halt method if the PID corresponds with the
PID we decided to kill.
In the initialState method, we will handle how the recovery will be done and
if the state was previously restored we will capture the info regarding it.


This is my understanding of the testing source code, but I have not clear
how it will really work and if I am capturing the real scenario
demonstration correctly.
I decided to test it using 1 JobManager and 3 TaskManagers (even the max
operator parallelism is 1).

The application will start running and constantly will be checkpointed. In
some moments the task will be killed and the application will be restored to
the last saved checkpoint. If the application has 4 failures (more than
allowed attempts 3), than we will successfully finish the application. Is
that correct?


/2.1 Is this how the logic of the scenario works?
//2.2 Is this an example of fault tolerance using checkpoints?
/

I will upload the screenshots of UI dashboard and an exception that I don't
really understand, but in some forums, it read that it was a problem with
job manager heap size.


I ask sorry if my question is not well-formatted or if it sounds stupid.
Best regards

fail-1.png
<http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/file/t2744/fail-1.png>  
fail-2.png
<http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/file/t2744/fail-2.png>  
fail-3.png
<http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/file/t2744/fail-3.png>  


[1] </a>
<http://ci.apache.org/projects/flink/flink-docs-stable/dev/stream/state/state.html#using-operator-state> 



--
Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/