You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@falcon.apache.org by "Balu Vellanki (JIRA)" <ji...@apache.org> on 2016/01/14 01:30:40 UTC
[jira] [Commented] (FALCON-141) Support cluster updates

    [ https://issues.apache.org/jira/browse/FALCON-141?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15097354#comment-15097354 ] 

Balu Vellanki commented on FALCON-141:
--------------------------------------

Here is the full summary of the discussion above, please review and comment. I will start implementing cluster update if there are not further comments.

In most real-world scenarios, updating a cluster from non-HA to HA will require the following steps.
1. Shutdown falcon.
2. Update hadoop cluster from non-HA to HA (or non-secure to secure).
3. Update falcon configs like startup.properties
4. Start falcon.

The initial solution was to update cluster entity after step-4 in a manner similar to updating a feed/process entity. This approach will not work because, when falcon is started, SharedLibraryHostingService needs to connect to hdfs of each cluster and copy relevant jars to <cluster_working_dir>/lib/. Since the cluster entity has old values for read/write endpoints, Falcon will fail in this step and will not start. The solution is to start falcon in a safe-mode, where falcon can start without having to access cluster-entity's hdfs location.

Proposed Solution :
—————————

# Stop falcon.
# Update hadoop cluster from non-HA to HA (or non-secure to secure).
# Start Falcon in safe-mode. In this mode
#* Falcon starts without starting SharedLibraryHostingService.
#* SuperUsers of falcon (as specified in startup.properties) can update falcon cluster-entity
#* No other write-operations are allowed in safe-mode.
# Update cluster-entity in Falcon
#* In this step, a SuperUser can update the existing cluster entity. The following fields can be updated by SuperUser
#** Cluster description, tags : Falcon will update the graphDB with new description and tags
#** All interfaces : The underlying read/write and workflow interfaces should be the same even if the url:port has changed
#** All locations : Falcon will validate new locations.
#** Properties
#* Get a lock on cluster entity in the ConfigStore.
#* If any interface, location or property is updated, the feeds and processes dependent on this cluster should be updated in the workflow engine. This should be done once Falcon is started in normal mode. So add a "requireUpdate" flag to each entity in the ConfigStore to specify that the feed/process entity should be updated in workflow engine.
#* Commit the new cluster entity to ConfigStore after validating entity.
# Re-start Falcon in normal mode. This requires following tasks to be added to existing falcon startup.
#* start SharedLibraryHostingService along with other services.
#* Update coordinator/bundle of all feed/process entities that are flagged during cluster entity update. The user of coord/bundle should be same as the prev owner of coord/bundle.
# Handling failures in cluster update
#* If update cluster entity fails, throw an exception to SuperUser with the reason. The user will have to fix and restart cluster entity.
#* "requireUpdate" flag is necessary because, the UpdateHelper::shouldUpdate() would compare old entity with new entity and returns true for all dependent entities. If the update fails for some dependent entities, retry should only happen on the failed dependent entities. Having a requireUpdate flag will help solve this. 
#* If update of coord/bundle of dependent entity fail,
#** Retain the requireUpdate flag on the entity.
#** Continue updating bundle/coord of remaining feed/process entities.
#** At the end of Falcon start, show warning to user with list of entities that could not be updated.
#** Provide ability to update bundle/coord of flagged entities individually.
# Handling re-run of past instances. 
#* The question is should the succeeded instance use the old coordinator or new coordinator with updated HA/security? For this version of cluster update, rerun of instances before the update with old coordinator will not run successfully. 
#* The solution is to create the new coordinators only when JT/RM or NN interfaces change and otherwise update existing entities so that reruns are handled correctly in those cases. Whenever JT/RM or NN endpoints change, we can issue a warning saying that post the update and processing, reruns for instances prior to the update cannot be handled.

Alternate Solution Proposed :
—————————————
Srikanth proposed an alternate solution that would do the following.
# Stop falcon, oozie
# Update hadoop cluster from non-HA to HA (or non-secure to secure). 
# Run a standalone tool that would 
#* Update the cluster entity definition directly in falcon config store.
#* Update the coordinator and bundles directly in Oozie database (could be mysql or postgres or…)
# Start falcon and Oozie

The advantage of this approach is that it ensures that re-run of past instances will succeed. The disadvantages are
# The tool is standalone, so it will not be possible to do rolling upgrade of Falcon from one version to another. 
# This solution would require hand-holding of falcon users.
# This requires tinkering with entries in Oozie (or the workflow scheduler) database. 

Due to the above mentioned reasons, we decided against using alternate solution. 

> Support cluster updates
> -----------------------
>
>                 Key: FALCON-141
>                 URL: https://issues.apache.org/jira/browse/FALCON-141
>             Project: Falcon
>          Issue Type: Bug
>            Reporter: Shwetha G S
>            Assignee: Balu Vellanki
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)