You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@flink.apache.org by "Bhupendra Yadav (Jira)" <ji...@apache.org> on 2024/01/05 07:35:00 UTC

[jira] [Updated] (FLINK-32631) FlinkSessionJob stuck in Created/Reconciling state because of No Job found error in JobManager

     [ https://issues.apache.org/jira/browse/FLINK-32631?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Bhupendra Yadav updated FLINK-32631:
------------------------------------
    Description: 
{*}Background{*}: We are using FlinkSessionJob for submitting jobs to a session cluster and flink kubernetes operator 1.5.0.

{*}Bug{*}: We frequently encounter a problem where the job gets stuck in CREATED/RECONCILING state. On checking flink operator logs we see the error {_}Job could not be found{_}. Full trace [here|https://ideone.com/NuAyEK].
 # When a Flink session job is submitted, the Flink operator submits the job to the Flink Cluster.
 # Assume the job is finished(or reached a terminal state) and the job manager (JM) restarts for some reason, the job will no longer exist in the JM.
 # Upon reconciliation, the Flink operator queries the JM's REST API for the job using its jobID, but it receives a 404 error, indicating that the job is not found.
 # The operator then encounters an error and logs it, leading to the job getting stuck in an indefinite state.
 # Attempting to restart or suspend the job using the operator's provided mechanisms also fails because the operator keeps calling the REST API and receiving the same 404 error.

{*}Expected Behavior{*}: Ideally, when the Flink operator reconciles a job and finds that it no longer exists in the Flink Cluster, it should handle the situation gracefully. Instead of getting stuck and logging errors indefinitely, the operator should mark the job as failed or deleted, or set an appropriate status for it.

  was:
{*}Background{*}: We are using FlinkSessionJob for submitting jobs to a session cluster and flink kubernetes operator 1.5.0.

{*}Bug{*}: We frequently encounter a problem where the job gets stuck in CREATED/RECONCILING state. On checking flink operator logs we see the error {_}Job could not be found{_}. Full trace [here|https://ideone.com/NuAyEK].
 # When a Flink session job is submitted, the Flink operator submits the job to the Flink Cluster.
 # If the Flink job manager (JM) restarts for some reason, the job may no longer exist in the JM.
 # Upon reconciliation, the Flink operator queries the JM's REST API for the job using its jobID, but it receives a 404 error, indicating that the job is not found.
 # The operator then encounters an error and logs it, leading to the job getting stuck in an indefinite state.
 # Attempting to restart or suspend the job using the operator's provided mechanisms also fails because the operator keeps calling the REST API and receiving the same 404 error.

{*}Expected Behavior{*}: Ideally, when the Flink operator reconciles a job and finds that it no longer exists in the Flink Cluster, it should handle the situation gracefully. Instead of getting stuck and logging errors indefinitely, the operator should mark the job as failed or deleted, or set an appropriate status for it.


> FlinkSessionJob stuck in Created/Reconciling state because of No Job found error in JobManager
> ----------------------------------------------------------------------------------------------
>
>                 Key: FLINK-32631
>                 URL: https://issues.apache.org/jira/browse/FLINK-32631
>             Project: Flink
>          Issue Type: Bug
>          Components: Kubernetes Operator
>    Affects Versions: 1.16.0
>         Environment: Local
>            Reporter: Bhupendra Yadav
>            Priority: Major
>
> {*}Background{*}: We are using FlinkSessionJob for submitting jobs to a session cluster and flink kubernetes operator 1.5.0.
> {*}Bug{*}: We frequently encounter a problem where the job gets stuck in CREATED/RECONCILING state. On checking flink operator logs we see the error {_}Job could not be found{_}. Full trace [here|https://ideone.com/NuAyEK].
>  # When a Flink session job is submitted, the Flink operator submits the job to the Flink Cluster.
>  # Assume the job is finished(or reached a terminal state) and the job manager (JM) restarts for some reason, the job will no longer exist in the JM.
>  # Upon reconciliation, the Flink operator queries the JM's REST API for the job using its jobID, but it receives a 404 error, indicating that the job is not found.
>  # The operator then encounters an error and logs it, leading to the job getting stuck in an indefinite state.
>  # Attempting to restart or suspend the job using the operator's provided mechanisms also fails because the operator keeps calling the REST API and receiving the same 404 error.
> {*}Expected Behavior{*}: Ideally, when the Flink operator reconciles a job and finds that it no longer exists in the Flink Cluster, it should handle the situation gracefully. Instead of getting stuck and logging errors indefinitely, the operator should mark the job as failed or deleted, or set an appropriate status for it.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)