You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@gobblin.apache.org by "ASF GitHub Bot (Jira)" <ji...@apache.org> on 2021/08/09 21:31:00 UTC

[jira] [Work logged] (GOBBLIN-1509) Ensure flows transition to FAILED and not stuck in COMPILED upon DagManager::addDag error

     [ https://issues.apache.org/jira/browse/GOBBLIN-1509?focusedWorklogId=636105&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-636105 ]

ASF GitHub Bot logged work on GOBBLIN-1509:
-------------------------------------------

                Author: ASF GitHub Bot
            Created on: 09/Aug/21 21:30
            Start Date: 09/Aug/21 21:30
    Worklog Time Spent: 10m 
      Work Description: phet opened a new pull request #3357:
URL: https://github.com/apache/gobblin/pull/3357


   Dear Gobblin maintainers,
   
   Please accept this PR. I understand that it will not be reviewed until I have checked off all the steps below!
   
   
   ### JIRA
   - [ ] My PR addresses the following [Gobblin JIRA](https://issues.apache.org/jira/browse/GOBBLIN/) issues and references them in the PR title. For example, "[GOBBLIN-XXX] My Gobblin PR"
   - [ ] 
       - https://issues.apache.org/jira/browse/GOBBLIN-1509
   
   ### Description
   - [ ] Here are some details about my PR, including screenshots (if applicable):
   
   Announce flow failure on DagManager::addDag error
   
   Additionally, migrate Orchestrator overall away from deprecated EventSubmitter::getTimingEvent factory method.
   
   Presently, addDag failure leaves the flow marooned in the COMPILED state, as the warranted FLOW_FAILED event is never sent.  Particularly insidious is that scheduled flows with their execution stuck in COMPILED miss their next execution, unless `flow.allowConcurrentExecutions` is set.  Thus the scheduled flow is stuck in its entirety, not merely a single execution.
   
   One observed cause of addDag failure is when the DagStateStore is backed by a replicated DB (e.g. MySqlDagStateStore) that just switched leaders.  Cached connections in the pool may suddenly point to a read-only follower unable to DagStateStore::writeCheckpoint.
   
   ### Tests
   - [ ] My PR adds the following unit tests __OR__ does not need testing for this extremely good reason:
   
   Manual testing while running GaaS locally:
   1. I added a flow to `runImmediatly`, which I soon after observed COMPLETE
   2. then I locally patched `DagManager::addDag` to mimic the motivating failure scenario by invariably throwing an `IOException`.
   3. I again added the same flow as 1.) (adjusted only to bear a unique name and target location) to `runImmediately`
   4. I observed that second flow as FAILED with the shimmed exception conveyed in the message.
   
   a. patched `DagManager`:
   ```
   --- a/gobblin-service/src/main/java/org/apache/gobblin/service/modules/orchestration/DagManager.java
   +++ b/gobblin-service/src/main/java/org/apache/gobblin/service/modules/orchestration/DagManager.java
   @@ -263,7 +263,8 @@ public class DagManager extends AbstractIdleService {
      synchronized void addDag(Dag<JobExecutionPlan> dag, boolean persist, boolean setStatus) throws IOException {
        if (persist) {
          //Persist the dag
   -      this.dagStateStore.writeCheckpoint(dag);
   +      throw new IOException("No, I won't add the DAG, dawg!");
   +      // this.dagStateStore.writeCheckpoint(dag);
        }
        int queueId = DagManagerUtils.getDagQueueId(dag, this.numThreads);
        // Add the dag to the specific queue determined by flowExecutionId
   ```
   
   b. submitted flow:
   ```
   {
     "id": {
       "flowName": "test005",
       "flowGroup": "testKip"
     },
     "templateUris": "FS:///",
     "properties": {
       "gobblin.flow.sourceIdentifier": "<<redacted-source>>",
       "gobblin.flow.destinationIdentifier": "<<(same) redacted-source>>",
       "user.to.proxy": "gobblintest",
       "gobblin.flow.input.dataset.descriptor.path": "<<redacted-path>>",
       "gobblin.flow.output.dataset.descriptor.path": "/tmp/gaas-testing/kip/test005",
       "gobblin.flow.input.dataset.descriptor.partition.type": "none",
       "gobblin.flow.output.dataset.descriptor.partition.type": "none",
       "gobblin.copy.simulate": "false",
       "flow.applyRetention": "false",
       "dataset.datetimePattern": "yyyy/MM/dd",
       "copy.date.pattern": "yyyy/MM/dd"
     },
     "schedule": {
   	"cronSchedule": "0 0 8 * * ? *",
   	"runImmediately": true
     }
   }
   ```
   
   c.  observed (`FAILED`) status:
   ```
   curli -k --dv-auth SELF "https://localhost:6956/sharedgobblinservice/flowexecutions?q=latestFlowExecution&flowId=(flowGroup:testKip,flowName:test005)" -X GET -H 'X-RestLi-Protocol-Version: 2.0.0'
     % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                    Dload  Upload   Total   Spent    Left  Speed
   100   361    0   361    0     0     44      0 --:--:--  0:00:08 --:--:--    74
   {
       "elements": [
           {
               "id": {
                   "flowGroup": "testKip",
                   "flowExecutionId": 1628209427967,
                   "flowName": "test005"
               },
               "message": "Failed to add Job Execution Plan due to: No, I won't add the DAG, dawg!",
               "executionStatistics": {
                   "executionEndTime": 1628209428045,
                   "executionStartTime": 1628209427967
               },
               "jobStatuses": [],
               "executionStatus": "FAILED"
           }
       ],
       "paging": {
           "count": 10,
           "start": 0,
           "links": []
       }
   }
   ```
   
   ### Commits
   - [ ] My commits all reference JIRA issues in their subject lines, and I have squashed multiple commits if they address the same issue. In addition, my commits follow the guidelines from "[How to write a good git commit message](http://chris.beams.io/posts/git-commit/)":
       1. Subject is separated from body by a blank line
       2. Subject is limited to 50 characters
       3. Subject does not end with a period
       4. Subject uses the imperative mood ("add", not "adding")
       5. Body wraps at 72 characters
       6. Body explains "what" and "why", not "how"
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@gobblin.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Issue Time Tracking
-------------------

            Worklog Id:     (was: 636105)
    Remaining Estimate: 0h
            Time Spent: 10m

> Ensure flows transition to FAILED and not stuck in COMPILED upon DagManager::addDag error
> -----------------------------------------------------------------------------------------
>
>                 Key: GOBBLIN-1509
>                 URL: https://issues.apache.org/jira/browse/GOBBLIN-1509
>             Project: Apache Gobblin
>          Issue Type: Bug
>            Reporter: Kip Kohn
>            Priority: Major
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> Presently, addDag failure leaves the flow marooned in the COMPILED state, as the warranted FLOW_FAILED event is never sent.  Particularly insidious is that scheduled flows with their execution stuck in COMPILED miss their next execution, unless `flow.allowConcurrentExecutions` is set.  Thus the scheduled flow is stuck in its entirety, not merely a single execution.
> One observed cause of addDag failure is when the DagStateStore is backed by a replicated DB (e.g. MySqlDagStateStore) that just switched leaders. Cached connections in the pool may suddently point to a read-only follower unable to DagStateStore::writeCheckpoint.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)