You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@streampark.apache.org by "xujiangfeng001 (via GitHub)" <gi...@apache.org> on 2023/06/20 16:04:01 UTC

[GitHub] [incubator-streampark] xujiangfeng001 opened a new pull request, #2809: [Feature][WIP][Service] flink cluster failure alarm&failover

xujiangfeng001 opened a new pull request, #2809:
URL: https://github.com/apache/incubator-streampark/pull/2809

   <!--
   Thank you for contributing to StreamPark! Please make sure that your code changes
   are covered with tests. And in case of new features or big changes
   remember to adjust the documentation.
   
   ## Contribution Checklist
   
     - If this is your first time, please read our contributor guidelines: [Submit Code](https://streampark.apache.org/community/submit_guide/submit_code).
   
     - Make sure that the pull request corresponds to a [GITHUB issue](https://github.com/apache/incubator-streampark/issues).
   
     - Name the pull request in the form "[Feature] Title of the pull request", where *Feature* can be replaced by `Hotfix`, `Bug`, etc.
   
     - Fill out the template below to describe the changes contributed by the pull request. That will give reviewers the context they need to do the review.
   
     - If the PR is unfinished, add `[WIP]` in your PR title, e.g., `[WIP][Feature] Title of the pull request`.
   
   -->
   
   ## What changes were proposed in this pull request
   
   Issue Number: close https://github.com/apache/incubator-streampark/issues/2423
   
   <!--(For example: This pull request proposed to add checkstyle plugin).-->
   
   ## Brief change log
   
   For details, please refer to: https://github.com/apache/incubator-streampark/issues/2423#issuecomment-1537068906.
   
   This PR only completes the server part, and the client configuration `alert` will be completed later.
   
   ## Verifying this change
   
   <!--*(Please pick either of the following options)*-->
   
   This change is a trivial rework / code cleanup without any test coverage.
   
   <!--*(example:)*
   - *Added integration tests for end-to-end.*
   - *Added *Test to verify the change.*
   - *Manually verified the change by testing locally.* -->
   
   ## Does this pull request potentially affect one of the following parts
    - Dependencies (does it add or upgrade a dependency): (no)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@streampark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [incubator-streampark] xujiangfeng001 commented on pull request #2809: [ISSUE-2423][Feature][WIP] flink cluster failure alarm&failover

Posted by "xujiangfeng001 (via GitHub)" <gi...@apache.org>.
xujiangfeng001 commented on PR #2809:
URL: https://github.com/apache/incubator-streampark/pull/2809#issuecomment-1612382077

   > Thx for the comments~
   > 
   > @xujiangfeng001 Would you mind changing the head title `[Feature][WIP][Service] flink cluster failure alarm&failover` like `[Issue-xxx][Feature] xxx` ?
   > 
   > I notice here's conflict when rebasing the `dev` branch, would you like to resolve it before the next review ? Thank you~
   > 
   > Hi, @MonsterChenzhuo Could you help to check the alarm logic on `k8s` mode ? thx so much.
   
   Hi @RocMarshal , thank you for your reply. I will finish modifying the code and handling conflicts before the next review.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@streampark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [incubator-streampark] RocMarshal commented on a diff in pull request #2809: [Feature][WIP][Service] flink cluster failure alarm&failover

Posted by "RocMarshal (via GitHub)" <gi...@apache.org>.
RocMarshal commented on code in PR #2809:
URL: https://github.com/apache/incubator-streampark/pull/2809#discussion_r1242310171


##########
streampark-console/streampark-console-service/src/main/assembly/script/upgrade/pgsql/2.2.0.sql:
##########
@@ -57,7 +57,10 @@ alter table "public"."t_flink_sql"
     add column "team_resource" varchar(64) default null;
 
 alter table "public"."t_flink_cluster"
-    add column "job_manager_url" varchar(150) collate "pg_catalog"."default";
+    add column "job_manager_url" varchar(150) collate "pg_catalog"."default",
+    add column "start_time" timestamp(6) collate "pg_catalog"."default",
+    add column "end_time" timestamp(6) collate "pg_catalog"."default",
+    add column "alert_id" int8 collate "pg_catalog"."default";

Review Comment:
   same as the mentioned above.



##########
streampark-console/streampark-console-service/src/main/assembly/script/upgrade/mysql/2.2.0.sql:
##########
@@ -44,7 +44,10 @@ alter table `t_flink_sql`
     add column `team_resource` varchar(64) default null;
 
 alter table `t_flink_cluster`
-    add column `job_manager_url` varchar(150) default null comment 'url address of jobmanager' after `address`;
+    add column `job_manager_url` varchar(150) default null comment 'url address of jobmanager' after `address`,
+    add column `start_time` datetime default null comment 'start time',
+    add column `end_time` datetime default null comment 'end time',
+    add column `alert_id` bigint default null comment 'alert id';

Review Comment:
   Do we need to split it into two statements based on PR granularity here?
   CC @wolfboys 



##########
streampark-console/streampark-console-service/src/main/resources/mapper/core/ApplicationMapper.xml:
##########
@@ -133,6 +133,14 @@
              limit 1
     </select>
 
+    <select id="getJobByClusterId" resultType="java.lang.Integer" parameterType="java.lang.Long">
+        SELECT
+            count(1)
+        FROM t_flink_app
+        WHERE flink_cluster_id = #{clusterId}
+            limit 1
+    </select>
+

Review Comment:
   - Do we need to filter the status of the job here ?  
   
   I sorted out briefly the background and logic.
   Assuming there is a job with a status of cancelled but current cluster information, will this job be counted if the cluster is abnormal?
   
   - We'd better to get a better select-id name here.
   
   
   Please correct me if I'm wrong.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@streampark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [incubator-streampark] xujiangfeng001 commented on pull request #2809: [Feature][WIP][Service] flink cluster failure alarm&failover

Posted by "xujiangfeng001 (via GitHub)" <gi...@apache.org>.
xujiangfeng001 commented on PR #2809:
URL: https://github.com/apache/incubator-streampark/pull/2809#issuecomment-1609820961

   Hi @RocMarshal, Thank you very much for your review and look forward to your reply. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@streampark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [incubator-streampark] xujiangfeng001 commented on a diff in pull request #2809: [ISSUE-2423][Feature][WIP] flink cluster failure alarm&failover

Posted by "xujiangfeng001 (via GitHub)" <gi...@apache.org>.
xujiangfeng001 commented on code in PR #2809:
URL: https://github.com/apache/incubator-streampark/pull/2809#discussion_r1248721141


##########
streampark-console/streampark-console-service/src/main/resources/mapper/core/ApplicationMapper.xml:
##########
@@ -133,6 +133,14 @@
              limit 1
     </select>
 
+    <select id="getJobByClusterId" resultType="java.lang.Integer" parameterType="java.lang.Long">
+        SELECT
+            count(1)
+        FROM t_flink_app
+        WHERE flink_cluster_id = #{clusterId}
+            limit 1
+    </select>
+

Review Comment:
   > * Do we need to filter the status of the job here ?
   > 
   > I sorted out briefly the background and logic. Assuming there is a job with a status of cancelled but current cluster information, will this job be counted if the cluster is abnormal?
   > 
   > * We'd better to get a better select-id name here.
   > 
   > Please correct me if I'm wrong.
   
   Hi @RocMarshal @wolfboys , Regarding this issue, I found in the implementation that due to the asynchronous monitoring of the `application` and `flink cluster` threads, it is not possible to directly determine whether it is an affected job based on the state in the application. I may be looking for a more suitable implementation method. I plan to maintain the original logic regarding this PR. If it is a job deployed in the `flink cluster`, it will be defined as an affected job regardless of its status.
   
   I look forward to your suggestions and responses very much.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@streampark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [incubator-streampark] xujiangfeng001 commented on pull request #2809: [Feature][WIP][Service] flink cluster failure alarm&failover

Posted by "xujiangfeng001 (via GitHub)" <gi...@apache.org>.
xujiangfeng001 commented on PR #2809:
URL: https://github.com/apache/incubator-streampark/pull/2809#issuecomment-1599933332

   Hi @wolfboys @RocMarshal, PTAL. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@streampark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [incubator-streampark] xujiangfeng001 commented on a diff in pull request #2809: [Feature][WIP][Service] flink cluster failure alarm&failover

Posted by "xujiangfeng001 (via GitHub)" <gi...@apache.org>.
xujiangfeng001 commented on code in PR #2809:
URL: https://github.com/apache/incubator-streampark/pull/2809#discussion_r1243990906


##########
streampark-console/streampark-console-service/src/main/resources/mapper/core/ApplicationMapper.xml:
##########
@@ -133,6 +133,14 @@
              limit 1
     </select>
 
+    <select id="getJobByClusterId" resultType="java.lang.Integer" parameterType="java.lang.Long">
+        SELECT
+            count(1)
+        FROM t_flink_app
+        WHERE flink_cluster_id = #{clusterId}
+            limit 1
+    </select>
+

Review Comment:
   I have carefully considered here and it is indeed necessary to filter the status.
   
   I want to filter the job status that is not `add` or `cancelled`. I need to explain why it is necessary to filter out tasks that are not `add` or `cancelled`:
   
   Because during the execution of this SQL statement, jobs in other states can be considered running or preparing to run in the `flink cluster`, but it may be due to the issue of two scheduling threads being out of sync,unable to update the job status in a timely manner, it may not be possible to determine the affected jobs based on a certain status.
   
   What do you think of `getAffectedJobsByClusterId` regarding select id.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@streampark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [incubator-streampark] xujiangfeng001 commented on a diff in pull request #2809: [Feature][WIP][Service] flink cluster failure alarm&failover

Posted by "xujiangfeng001 (via GitHub)" <gi...@apache.org>.
xujiangfeng001 commented on code in PR #2809:
URL: https://github.com/apache/incubator-streampark/pull/2809#discussion_r1243998046


##########
streampark-console/streampark-console-service/src/main/assembly/script/upgrade/mysql/2.2.0.sql:
##########
@@ -44,7 +44,10 @@ alter table `t_flink_sql`
     add column `team_resource` varchar(64) default null;
 
 alter table `t_flink_cluster`
-    add column `job_manager_url` varchar(150) default null comment 'url address of jobmanager' after `address`;
+    add column `job_manager_url` varchar(150) default null comment 'url address of jobmanager' after `address`,
+    add column `start_time` datetime default null comment 'start time',
+    add column `end_time` datetime default null comment 'end time',
+    add column `alert_id` bigint default null comment 'alert id';

Review Comment:
   I believe that these three statements are all related to the granularity of the `flink cluster` alarm, so I do not think it is necessary to divide them into multiple statements. If I am wrong, please correct me. CC @wolfboys 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@streampark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [incubator-streampark] RocMarshal commented on pull request #2809: [Feature][WIP][Service] flink cluster failure alarm&failover

Posted by "RocMarshal (via GitHub)" <gi...@apache.org>.
RocMarshal commented on PR #2809:
URL: https://github.com/apache/incubator-streampark/pull/2809#issuecomment-1599965265

   > Hi @wolfboys @RocMarshal, PTAL.
   
   @xujiangfeng001 Thanks for the contribution. I'll check it ASAP.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@streampark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [incubator-streampark] RocMarshal commented on a diff in pull request #2809: [Feature][WIP][Service] flink cluster failure alarm&failover

Posted by "RocMarshal (via GitHub)" <gi...@apache.org>.
RocMarshal commented on code in PR #2809:
URL: https://github.com/apache/incubator-streampark/pull/2809#discussion_r1242310171


##########
streampark-console/streampark-console-service/src/main/assembly/script/upgrade/pgsql/2.2.0.sql:
##########
@@ -57,7 +57,10 @@ alter table "public"."t_flink_sql"
     add column "team_resource" varchar(64) default null;
 
 alter table "public"."t_flink_cluster"
-    add column "job_manager_url" varchar(150) collate "pg_catalog"."default";
+    add column "job_manager_url" varchar(150) collate "pg_catalog"."default",
+    add column "start_time" timestamp(6) collate "pg_catalog"."default",
+    add column "end_time" timestamp(6) collate "pg_catalog"."default",
+    add column "alert_id" int8 collate "pg_catalog"."default";

Review Comment:
   Do we need to split it into two statements based on PR granularity here?
   CC @wolfboys 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@streampark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [incubator-streampark] RocMarshal commented on a diff in pull request #2809: [Feature][WIP][Service] flink cluster failure alarm&failover

Posted by "RocMarshal (via GitHub)" <gi...@apache.org>.
RocMarshal commented on code in PR #2809:
URL: https://github.com/apache/incubator-streampark/pull/2809#discussion_r1245054067


##########
streampark-console/streampark-console-service/src/main/resources/mapper/core/ApplicationMapper.xml:
##########
@@ -133,6 +133,14 @@
              limit 1
     </select>
 
+    <select id="getJobByClusterId" resultType="java.lang.Integer" parameterType="java.lang.Long">
+        SELECT
+            count(1)
+        FROM t_flink_app
+        WHERE flink_cluster_id = #{clusterId}
+            limit 1
+    </select>
+

Review Comment:
   That sounds good~



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@streampark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [incubator-streampark] wolfboys commented on pull request #2809: [ISSUE-2423][Feature][WIP] flink cluster failure alarm&failover

Posted by "wolfboys (via GitHub)" <gi...@apache.org>.
wolfboys commented on PR #2809:
URL: https://github.com/apache/incubator-streampark/pull/2809#issuecomment-1615798557

   Overall it looks good, I'll merged this pr first, there are still some minor problems, We can re-submit pr for improvement.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@streampark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [incubator-streampark] wolfboys merged pull request #2809: [ISSUE-2423][Feature][WIP] flink cluster failure alarm&failover

Posted by "wolfboys (via GitHub)" <gi...@apache.org>.
wolfboys merged PR #2809:
URL: https://github.com/apache/incubator-streampark/pull/2809


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@streampark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org