You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@ambari.apache.org by Jonathan Hurley <jh...@hortonworks.com> on 2017/02/02 16:18:02 UTC

Re: Review Request 53686: Stage and Request status should be persisted in the database


> On Dec. 15, 2016, 11:49 a.m., Jonathan Hurley wrote:
> > ambari-server/src/main/java/org/apache/ambari/server/controller/internal/CalculatedStatus.java, lines 282-287
> > <https://reviews.apache.org/r/53686/diff/5/?file=1584766#file1584766line282>
> >
> >     Would this count states like FAILED / TIMEDOUT twice?
> 
> Jaimin Jetly wrote:
>     yes. It will increment the FAILED counter and also increment COMPLETED counter which is desired.

Why is this desired? Double counting them doesn't seem right.


> On Dec. 15, 2016, 11:49 a.m., Jonathan Hurley wrote:
> > ambari-server/src/main/java/org/apache/ambari/server/actionmanager/ActionDBAccessorImpl.java, lines 445-447
> > <https://reviews.apache.org/r/53686/diff/5/?file=1584761#file1584761line445>
> >
> >     This could be problematic if the merge fails when the transaction commits. You've already fired the event.
> 
> Jaimin Jetly wrote:
>     persistAction method has been annotated with @Transactional and @TransactionalLock.
>     I believe if the publisher method is marked to be transactional then the event listener will also fall in that transactional boundary considering that the eventbus being used over here is default which I believe is synchronous. 
>     
>     I will verify this myself by checking that publsiher method, eventbus and subscriber method are running on same thread.
> 
> Jaimin Jetly wrote:
>     I verified that the publisher and subscriber code runs on same thread and so transactional boundary will extend for the entire logic making sure that tasks and respective stage/request are consistent in their status

Although it may be true that the event bus is synchronous here, the point is that the merge() which comes before it doesn't necessarily count until the entire transaction is committed. By firing the event in this code, you're assuming that the merge() will succeed once the batches operations are flushed and the transaction is commited. If the listener modifies any sort of in-memory state, then this causes a data integrity problem.


On Dec. 15, 2016, 11:49 a.m., Jaimin Jetly wrote:
> > One thing I don't quite see here (and it could be due to the size of the patch) is what happens in these two cases:
> > - Something goes wrong when trying to store a task's status. How does the system recover and mark it completed?
> > - What about waiting until a request is HOLDING and then restarting Ambari - will the relevent maps get re-populated?
> 
> Jaimin Jetly wrote:
>     >> - Something goes wrong when trying to store a task's status. How does the system recover and mark it completed?
>     
>     This work only adds logic to add/update stage and request status. The way task status is being updated or the logic for system to recover from anything that goes wrong when storing task status is not changed.
>     This work ensures that task status update, respective stage status update and respective request status update happens inside same transactional boundary. Thus all three entities remains consistent in the status they show. 
>     This work does not add any recovery logic and piggybacks on existing failure recovery mechanism for updating task status. Thus if something goes wrong storing task status then stage/request status will also be not store and vice-versa. Next time when ambari-agent sends command reports again then task update and respective stage/request status update should also also get updated successfully. 
>     
>     >> -  What about waiting until a request is HOLDING and then restarting Ambari - will the relevent maps get re-populated?
>     
>     Yes, everytime ambari-server starts, we check for all stages in HostRoleStatus.IN_PROGRESS_STATUSES and publish an event with the tasks. these will repopulate the maps. The patch adds that logic with "publishInProgressTasks(stages)" method in ActionScheduler.java
>     
>     I have tested that scenario of restarting ambari-server when a request is ongoing with in progress tasks and validating that stages and requests status is correctly updated
> 
> Sid Wagle wrote:
>     The agent re-sending command reports are we generally fault tolerant in that area? Sicne you are probably mostly up-to-date on that part of the code can you shed some light on loss of task status scenario. Do we recover the correct status?
> 
> Jaimin Jetly wrote:
>     Yes. Largely. 
>     Although There are following specific cases:
>     1) If persisting of "request, it's stages and it's host role commands" for the first time does not happen then the request will be completely missed and ambari-server will never log in the database or track it in anyways.
>     2) If "request, it's stages and it's host role commands" are persisted but subsequent updates and DB merge keeps failing then request will be eventually declared timeout and so will all hrcs
>     
>     As far as DB merge from subsequent agent command reports is successfully committed to DB, earlier failures will be recovered.

But agent status / heartbeat processing is asynchronous - so once we receive the status and try to process it, the agent has already delivered its payload. So, if there's a failure processing the completed status, it will never be marked as completed. How does it recover in this scenario - how does the correct status get updated in the DB since the agent doesn't re-send a delivered command status?


- Jonathan


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/53686/#review159315
-----------------------------------------------------------


On Jan. 18, 2017, 5:33 p.m., Jaimin Jetly wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/53686/
> -----------------------------------------------------------
> 
> (Updated Jan. 18, 2017, 5:33 p.m.)
> 
> 
> Review request for Ambari, Jonathan Hurley, Nate Cole, Sumit Mohanty, and Sid Wagle.
> 
> 
> Bugs: AMBARI-18868
>     https://issues.apache.org/jira/browse/AMBARI-18868
> 
> 
> Repository: ambari
> 
> 
> Description
> -------
> 
> Stage and Request status should be persisted in the database.
> 
> upgrading to ambari-3.0.0 should add status for all present stages and request for the cluster.
> 
> 
> Diffs
> -----
> 
>   ambari-server/src/main/java/org/apache/ambari/server/actionmanager/ActionDBAccessorImpl.java 7837a7b 
>   ambari-server/src/main/java/org/apache/ambari/server/actionmanager/ActionScheduler.java dabcb98 
>   ambari-server/src/main/java/org/apache/ambari/server/actionmanager/HostRoleStatus.java 3656bfe 
>   ambari-server/src/main/java/org/apache/ambari/server/actionmanager/Request.java 31e11c1 
>   ambari-server/src/main/java/org/apache/ambari/server/actionmanager/Stage.java 4a05b32 
>   ambari-server/src/main/java/org/apache/ambari/server/controller/internal/CalculatedStatus.java 3c415df 
>   ambari-server/src/main/java/org/apache/ambari/server/events/TaskCreateEvent.java PRE-CREATION 
>   ambari-server/src/main/java/org/apache/ambari/server/events/TaskEvent.java PRE-CREATION 
>   ambari-server/src/main/java/org/apache/ambari/server/events/TaskUpdateEvent.java PRE-CREATION 
>   ambari-server/src/main/java/org/apache/ambari/server/events/listeners/tasks/TaskStatusListener.java PRE-CREATION 
>   ambari-server/src/main/java/org/apache/ambari/server/events/publishers/TaskEventPublisher.java PRE-CREATION 
>   ambari-server/src/main/java/org/apache/ambari/server/orm/dao/HostRoleCommandDAO.java 02c4091 
>   ambari-server/src/main/java/org/apache/ambari/server/orm/dao/RequestDAO.java 1c4d0a3 
>   ambari-server/src/main/java/org/apache/ambari/server/orm/dao/StageDAO.java d2f899f 
>   ambari-server/src/main/java/org/apache/ambari/server/orm/entities/HostRoleCommandEntity.java 74271b9 
>   ambari-server/src/main/java/org/apache/ambari/server/orm/entities/RequestEntity.java 7944d21 
>   ambari-server/src/main/java/org/apache/ambari/server/orm/entities/StageEntity.java f9c8810 
>   ambari-server/src/main/java/org/apache/ambari/server/orm/entities/StageEntityPK.java 9ca0470 
>   ambari-server/src/main/java/org/apache/ambari/server/upgrade/UpgradeCatalog300.java 4f90ef3 
>   ambari-server/src/main/resources/Ambari-DDL-Derby-CREATE.sql b79c945 
>   ambari-server/src/main/resources/Ambari-DDL-MySQL-CREATE.sql 1c502bc 
>   ambari-server/src/main/resources/Ambari-DDL-Oracle-CREATE.sql c6d4ad0 
>   ambari-server/src/main/resources/Ambari-DDL-Postgres-CREATE.sql 1be87bb 
>   ambari-server/src/main/resources/Ambari-DDL-SQLAnywhere-CREATE.sql abe48e8 
>   ambari-server/src/main/resources/Ambari-DDL-SQLServer-CREATE.sql 169a464 
>   ambari-server/src/test/java/org/apache/ambari/server/actionmanager/TestActionDBAccessorImpl.java 1ca777d 
>   ambari-server/src/test/java/org/apache/ambari/server/actionmanager/TestActionScheduler.java 6cc511e 
>   ambari-server/src/test/java/org/apache/ambari/server/controller/internal/UpgradeResourceProviderTest.java a702e6f 
>   ambari-server/src/test/java/org/apache/ambari/server/controller/internal/UpgradeSummaryResourceProviderTest.java e398a54 
>   ambari-server/src/test/java/org/apache/ambari/server/events/listeners/tasks/TaskStatusListenerTest.java PRE-CREATION 
>   ambari-server/src/test/java/org/apache/ambari/server/orm/dao/UpgradeDAOTest.java ae85241 
>   ambari-server/src/test/java/org/apache/ambari/server/state/services/RetryUpgradeActionServiceTest.java 7dd9932 
>   ambari-server/src/test/java/org/apache/ambari/server/upgrade/UpgradeCatalog300Test.java d7979e8 
> 
> Diff: https://reviews.apache.org/r/53686/diff/
> 
> 
> Testing
> -------
> 
> Verified manually on a cluster by making api requests and upgrading ambari.
> Add unit tests.
> Verified that the patch does not break any existing unit tests on dev box. Jenkins job overall unit test result pending..
> Verified on 1000 node cluster that the patch does not regress big operations. Executed Stop Services and Start Services API call which gernerated around 9000 tasks and compared  request completion time for these operations. There was a minor performance gain with the patch. As part of https://issues.apache.org/jira/browse/AMBARI-18889, I will look if we can use the request status and stage status to further enhance performance.
> 
> 
> Thanks,
> 
> Jaimin Jetly
> 
>