You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@falcon.apache.org by ve...@apache.org on 2014/08/08 19:43:46 UTC

[7/9] git commit: FALCON-468 Add User Documentation for authorization feature. Contributed by Venkatesh Seetharam

FALCON-468 Add User Documentation for authorization feature. Contributed by Venkatesh Seetharam


Project: http://git-wip-us.apache.org/repos/asf/incubator-falcon/repo
Commit: http://git-wip-us.apache.org/repos/asf/incubator-falcon/commit/7b3e5107
Tree: http://git-wip-us.apache.org/repos/asf/incubator-falcon/tree/7b3e5107
Diff: http://git-wip-us.apache.org/repos/asf/incubator-falcon/diff/7b3e5107

Branch: refs/heads/master
Commit: 7b3e51079c6ac39808130e0a1df2e6adeb55db07
Parents: adca005
Author: Venkatesh Seetharam <ve...@apache.org>
Authored: Fri Aug 8 10:17:35 2014 -0700
Committer: Venkatesh Seetharam <ve...@apache.org>
Committed: Fri Aug 8 10:22:40 2014 -0700

----------------------------------------------------------------------
 docs/src/site/twiki/EntitySpecification.twiki |  82 +++++++---
 docs/src/site/twiki/FalconDocumentation.twiki |   5 +
 docs/src/site/twiki/Security.twiki            | 166 ++++++++++++++++++++-
 docs/src/site/twiki/index.twiki               |   2 +-
 4 files changed, 231 insertions(+), 24 deletions(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/incubator-falcon/blob/7b3e5107/docs/src/site/twiki/EntitySpecification.twiki
----------------------------------------------------------------------
diff --git a/docs/src/site/twiki/EntitySpecification.twiki b/docs/src/site/twiki/EntitySpecification.twiki
index d387c9c..4eb4b17 100644
--- a/docs/src/site/twiki/EntitySpecification.twiki
+++ b/docs/src/site/twiki/EntitySpecification.twiki
@@ -17,7 +17,9 @@ The colo specifies the colo to which this cluster belongs to and name is the nam
 be unique.
 
 
-A cluster has varies interfaces as described below:
+---+++ Interfaces
+
+A cluster has various interfaces as described below:
 <verbatim>
     <interface type="readonly" endpoint="hftp://localhost:50010" version="0.20.2" />
 </verbatim>
@@ -56,15 +58,32 @@ Although Hive metastore supports both RPC and HTTP, Falcon comes with an impleme
 </verbatim>
 A messaging interface specifies the interface for sending feed availability messages, it's endpoint is broker url with tcp address.
 
+---+++ Locations
+
 A cluster has a list of locations defined:
 <verbatim>
 <location name="staging" path="/projects/falcon/staging" />
+<location name="working" path="/projects/falcon/working" />
 </verbatim>
 Location has the name and the path, name is the type of locations like staging, temp and working.
 and path is the hdfs path for each location.
 Falcon would use the location to do intermediate processing of entities in hdfs and hence Falcon
 should have read/write/execute permission on these locations.
 
+---+++ ACL
+
+A cluster has ACL (Access Control List) useful for implementing permission requirements
+and provide a way to set different permissions for specific users or named groups.
+<verbatim>
+    <ACL owner="test-user" group="test-group" permission="*"/>
+</verbatim>
+ACL indicates the Access control list for this cluster.
+owner is the Owner of this entity.
+group is the one which has access to read.
+permission indicates the permission.
+
+---+++ Custom Properties
+
 A cluster has a list of properties:
 A key-value pair, which are propagated to the workflow engine.
 <verbatim>
@@ -217,7 +236,19 @@ upto 8 hours then late-arrival's cut-off="hours(8)"
 
 *Note:* This will only apply for !FileSystem storage but not Table storage until a future time.
 
----++++ Custom Properties
+---+++ ACL
+
+A feed has ACL (Access Control List) useful for implementing permission requirements
+and provide a way to set different permissions for specific users or named groups.
+<verbatim>
+    <ACL owner="test-user" group="test-group" permission="*"/>
+</verbatim>
+ACL indicates the Access control list for this cluster.
+owner is the Owner of this entity.
+group is the one which has access to read.
+permission indicates the permission.
+
+---+++ Custom Properties
 
 <verbatim>
     <properties>
@@ -240,7 +271,7 @@ waiting for the feed instance and parallel decides the concurrent replication in
 A process defines configuration for a workflow. A workflow is a directed acyclic graph(DAG) which defines the job for the workflow engine. A process definition defines  the configurations required to run the workflow job. For example, process defines the frequency at which the workflow should run, the clusters on which the workflow should run, the inputs and outputs for the workflow, how the workflow failures should be handled, how the late inputs should be handled and so on.  
 
 The different details of process are:
----++++ Name
+---+++ Name
 Each process is identified with a unique name.
 Syntax:
 <verbatim>
@@ -249,7 +280,7 @@ Syntax:
 </process>
 </verbatim>
 
----++++ Cluster
+---+++ Cluster
 The cluster on which the workflow should run. A process should contain one or more clusters. Cluster definition for the cluster name gives the end points for workflow execution, name node, job tracker, messaging and so on. Each cluster inturn has validity mentioned, which tell the times between which the job should run on that specified cluster. 
 Syntax:
 <verbatim>
@@ -270,7 +301,7 @@ Syntax:
 </process>
 </verbatim>
 
----++++ Parallel
+---+++ Parallel
 Parallel defines how many instances of the workflow can run concurrently. It should be a positive integer > 0.
 For example, parallel of 1 ensures that only one instance of the workflow can run at a time. The next instance will start only after the running instance completes.
 Syntax:
@@ -282,7 +313,7 @@ Syntax:
 </process>
 </verbatim>
 
----++++ Order
+---+++ Order
 Order defines the order in which the ready instances are picked up. The possible values are FIFO(First In First Out), LIFO(Last In First Out), and ONLYLAST(Last Only).
 Syntax:
 <verbatim>
@@ -293,7 +324,7 @@ Syntax:
 </process>
 </verbatim>
 
----++++ Timeout
+---+++ Timeout
 A optional Timeout specifies the maximum time an instance waits for a dataset before being killed by the workflow engine, a time out is specified like frequency.
 If timeout is not specified, falcon computes a default timeout for a process based on its frequency, which is six times of the frequency of process or 30 minutes if computed timeout is less than 30 minutes.
 <verbatim>
@@ -304,7 +335,7 @@ If timeout is not specified, falcon computes a default timeout for a process bas
 </process>
 </verbatim>
 
----++++ Frequency
+---+++ Frequency
 Frequency defines how frequently the workflow job should run. For example, hours(1) defines the frequency as hourly, days(7) defines weekly frequency. The values for timeunit can be minutes/hours/days/months and the frequency number should be a positive integer > 0. 
 Syntax:
 <verbatim>
@@ -315,7 +346,7 @@ Syntax:
 </process>
 </verbatim>
 
----++++ Validity
+---+++ Validity
 Validity defines how long the workflow should run. It has 3 components - start time, end time and timezone. Start time and end time are timestamps defined in yyyy-MM-dd'T'HH:mm'Z' format and should always be in UTC. Timezone is used to compute the next instances starting from start time. The workflow will start at start time and end before end time specified on a given cluster. So, there will not be a workflow instance at end time.
 Syntax:
 <verbatim>
@@ -347,7 +378,7 @@ The daily workflow will start on Jan 1st 2012 at 00:40 UTC, it will run at 40th
 </verbatim>
 The hourly workflow will start on March 11th 2012 at 00:40 PST, the next instances will be at 01:40 PST, 03:40 PDT, 04:40 PDT and so on till 23:40 PDT. So, there will be just 23 instances of the workflow for March 11th 2012 because of DST switch.
 
----++++ Inputs
+---+++ Inputs
 Inputs define the input data for the workflow. The workflow job will start executing only after the schedule time and when all the inputs are available. There can be 0 or more inputs and each of the input maps to a feed. The path and frequency of input data is picked up from feed definition. Each input should also define start and end instances in terms of [[FalconDocumentation][EL expressions]] and can optionally specify specific partition of input that the workflow requires. The components in partition should be subset of partitions defined in the feed.
 
 For each input, Falcon will create a property with the input name that contains the comma separated list of input paths. This property can be used in workflow actions like pig scripts and so on.
@@ -447,7 +478,7 @@ Example workflow configuration:
 </verbatim>
 
 
----++++ Optional Inputs
+---+++ Optional Inputs
 User can mention one or more inputs as optional inputs. In such cases the job does not wait on those inputs which are
 mentioned as optional. If they are present it considers them otherwise continue with the compulsory ones.
 Example:
@@ -477,7 +508,7 @@ Example:
 *Note:* This is only supported for !FileSystem storage but not Table storage at this point.
 
 
----++++ Outputs
+---+++ Outputs
 Outputs define the output data that is generated by the workflow. A process can define 0 or more outputs. Each output is mapped to a feed and the output path is picked up from feed definition. The output instance that should be generated is specified in terms of [[FalconDocumentation][EL expression]].
 
 For each output, Falcon creates a property with output name that contains the path of output data. This can be used in workflows to store in the path.
@@ -561,7 +592,7 @@ Example workflow configuration:
 </configuration>
 </verbatim>
 
----++++ Properties
+---+++ Custom Properties
 The properties are key value pairs that are passed to the workflow. These properties are optional and can be used
 in workflow to parameterize the workflow.
 Syntax:
@@ -582,7 +613,7 @@ queueName and jobPriority are special properties, which when present are used by
         <property name="jobPriority" value="VERY_HIGH"/>
 </verbatim>
 
----++++ Workflow
+---+++ Workflow
 
 The workflow defines the workflow engine that should be used and the path to the workflow on hdfs.
 The workflow definition on hdfs contains the actual job that should run and it should confirm to
@@ -594,7 +625,7 @@ be available for the workflow.
 
 There are 2 engines supported today.
 
----+++++ Oozie
+---++++ Oozie
 
 As part of oozie workflow engine support, users can embed a oozie workflow.
 Refer to oozie [[http://oozie.apache.org/docs/4.0.0/DG_Overview.html][workflow overview]] and
@@ -621,7 +652,7 @@ Example:
 This defines the workflow engine to be oozie and the workflow xml is defined at
 /projects/bootcamp/workflow/workflow.xml. The libraries are at /projects/bootcamp/workflow/lib.
 
----+++++ Pig
+---++++ Pig
 
 Falcon also adds the Pig engine which enables users to embed a Pig script as a process.
 
@@ -640,7 +671,7 @@ This defines the workflow engine to be pig and the pig script is defined at
 Feeds with Hive table storage will send one more parameter apart from the general ones:
 <verbatim>$input_filter</verbatim>
 
----+++++ Hive
+---++++ Hive
 
 Falcon also adds the Hive engine as part of Hive Integration which enables users to embed a Hive script as a process.
 This would enable users to create materialized queries in a declarative way.
@@ -660,7 +691,7 @@ This defines the workflow engine to be hive and the hive script is defined at
 Feeds with Hive table storage will send one more parameter apart from the general ones:
 <verbatim>$input_filter</verbatim>
 
----++++ Retry
+---+++ Retry
 Retry policy defines how the workflow failures should be handled. Two retry policies are defined: backoff and exp-backoff(exponential backoff). Depending on the delay and number of attempts, the workflow is re-tried after specific intervals.
 Syntax:
 <verbatim>
@@ -681,7 +712,7 @@ Examples:
 </verbatim>
 The workflow is re-tried after 10 mins, 20 mins and 30 mins. With exponential backoff, the workflow will be re-tried after 10 mins, 20 mins and 40 mins.
 
----++++ Late data
+---+++ Late data
 Late data handling defines how the late data should be handled. Each feed is defined with a late cut-off value which specifies the time till which late data is valid. For example, late cut-off of hours(6) means that data for nth hour can get delayed by upto 6 hours. Late data specification in process defines how this late data is handled.
 
 Late data policy defines how frequently check is done to detect late data. The policies supported are: backoff, exp-backoff(exponention backoff) and final(at feed's late cut-off). The policy along with delay defines the interval at which late data check is done.
@@ -724,3 +755,16 @@ Example:
 This late handling specifies that late data detection should run at feed's late cut-off which is 6 hours in this case. If there is late data, Falcon should run the workflow specified at /projects/bootcamp/workflow/lateinput1/workflow.xml
 
 *Note:* This is only supported for !FileSystem storage but not Table storage at this point.
+
+---+++ ACL
+
+A process has ACL (Access Control List) useful for implementing permission requirements
+and provide a way to set different permissions for specific users or named groups.
+<verbatim>
+    <ACL owner="test-user" group="test-group" permission="*"/>
+</verbatim>
+ACL indicates the Access control list for this cluster.
+owner is the Owner of this entity.
+group is the one which has access to read.
+permission indicates the permission.
+

http://git-wip-us.apache.org/repos/asf/incubator-falcon/blob/7b3e5107/docs/src/site/twiki/FalconDocumentation.twiki
----------------------------------------------------------------------
diff --git a/docs/src/site/twiki/FalconDocumentation.twiki b/docs/src/site/twiki/FalconDocumentation.twiki
index 267e6d1..8d2e10e 100644
--- a/docs/src/site/twiki/FalconDocumentation.twiki
+++ b/docs/src/site/twiki/FalconDocumentation.twiki
@@ -13,6 +13,7 @@
    * <a href="#Alerting_and_Monitoring">Alerting and Monitoring</a>
    * <a href="#Falcon_EL_Expressions">Falcon EL Expressions</a>
    * <a href="#Lineage">Lineage</a>
+   * <a href="#Security">Security</a>
 
 ---++ Architecture
 ---+++ Introduction
@@ -743,3 +744,7 @@ config value: org.apache.falcon.metadata.MetadataMappingService
 
 Lineage is only captured for Process executions. A future release will capture lineage for
 lifecycle policies such as replication and retention.
+
+--++ Security
+
+Security is detailed in [[Security][Security]].

http://git-wip-us.apache.org/repos/asf/incubator-falcon/blob/7b3e5107/docs/src/site/twiki/Security.twiki
----------------------------------------------------------------------
diff --git a/docs/src/site/twiki/Security.twiki b/docs/src/site/twiki/Security.twiki
index c1f7656..4dc0f4d 100644
--- a/docs/src/site/twiki/Security.twiki
+++ b/docs/src/site/twiki/Security.twiki
@@ -2,6 +2,12 @@
 
 ---++ Overview
 
+Apache Falcon enforces authentication and authorization which are detailed below. Falcon also
+provides transport level security ensuring data confidentiality and integrity.
+
+
+---++ Authentication (User Identity)
+
 Apache Falcon enforces authentication on protected resources. Once authentication has been established it sets a
 signed HTTP Cookie that contains an authentication token with the user name, user principal,
 authentication type and expiration time.
@@ -12,20 +18,127 @@ for HTTP. Hadoop Auth also supports additional authentication mechanisms on the
 simple interfaces.
 
 
----++ Authentication Methods
+---+++ Authentication Methods
 
 It supports 2 authentication methods, simple and kerberos out of the box.
 
----+++ Pseudo/Simple Authentication
+---++++ Pseudo/Simple Authentication
 
 Falcon authenticates the user by simply trusting the value of the query string parameter 'user.name'. This is the
 default mode Falcon is configured with.
 
----+++ Kerberos Authentication
+---++++ Kerberos Authentication
 
 Falcon uses HTTP Kerberos SPNEGO to authenticate the user.
 
----++ Server Side Configuration Setup
+
+---++ Authorization
+
+Falcon also enforces authorization on Entities using ACLs (Access Control Lists). ACLs are useful
+for implementing permission requirements and provide a way to set different permissions for
+specific users or named groups.
+
+By default, support for authorization is disabled and can be enabled in startup.properties.
+
+---+++ ACLs in Entity
+
+All Entities now have ACL which needs to be present if authorization is enabled. Only owners who
+own or created the entity will be allowed to update or delete their entities.
+
+An entity has ACLs (Access Control Lists) that are useful for implementing permission requirements
+and provide a way to set different permissions for specific users or named groups.
+<verbatim>
+    <ACL owner="test-user" group="test-group" permission="*"/>
+</verbatim>
+ACL indicates the Access control list for this cluster.
+owner is the Owner of this entity.
+group is the one which has access to read.
+permission indicates the rwx is not enforced at this time.
+
+---+++ Group Memberships
+
+Once a user has been authenticated and a username has been determined, the list of groups is
+determined by a group mapping service, configured by the hadoop.security.group.mapping property
+in Hadoop. The default implementation, org.apache.hadoop.security.ShellBasedUnixGroupsMapping,
+will shell out to the Unix bash -c groups command to resolve a list of groups for a user.
+
+Note that Falcon stores the user and group of an Entity as strings; there is no
+conversion from user and group identity numbers as is conventional in Unix.
+
+---+++ Authorization Provider
+
+Falcon provides a plugin-able provider interface for Authorization. It also ships with a default
+implementation that enforces the following authorization policy.
+
+---++++ Entity and Instance Management Operations Policy
+
+* All Entity and Instance operations are authorized for users who created them, Owners and users
+with group memberships
+* Reference to entities with in a feed or process is allowed with out enforcing permissions
+Any Feed or Process can refer to a Cluster entity not owned by the Feed or Process owner
+Any Process can refer to a Feed entity not owned by the Process owner
+
+The authorization is enforced in the following way:
+
+if admin resource,
+     if authenticated user name matches the admin users configuration
+     Else if groups of the authenticated user matches the admin groups configuration
+     Else authorization exception is thrown
+Else if entities or instance resource
+     if the authenticated user matches the owner in ACL for the entity
+     Else if the groups of the authenticated user matches the group in ACL for the entity
+     Else authorization exception is thrown
+Else if lineage resource
+     All have read-only permissions, reason being folks should be able to examine the dependency
+     and allow reuse
+
+
+*operations on Entity Resource*
+
+| *Resource*                                                                          | *Description*                      | *Authorization* |
+| [[restapi/EntityValidate][api/entities/validate/:entity-type]]                      | Validate the entity                | Owner/Group     |
+| [[restapi/EntitySubmit][api/entities/submit/:entity-type]]                          | Submit the entity                  | Owner/Group     |
+| [[restapi/EntityUpdate][api/entities/update/:entity-type/:entity-name]]             | Update the entity                  | Owner/Group     |
+| [[restapi/EntitySubmitAndSchedule][api/entities/submitAndSchedule/:entity-type]]    | Submit & Schedule the entity       | Owner/Group     |
+| [[restapi/EntitySchedule][api/entities/schedule/:entity-type/:entity-name]]         | Schedule the entity                | Owner/Group     |
+| [[restapi/EntitySuspend][api/entities/suspend/:entity-type/:entity-name]]           | Suspend the entity                 | Owner/Group     |
+| [[restapi/EntityResume][api/entities/resume/:entity-type/:entity-name]]             | Resume the entity                  | Owner/Group     |
+| [[restapi/EntityDelete][api/entities/delete/:entity-type/:entity-name]]             | Delete the entity                  | Owner/Group     |
+| [[restapi/EntityStatus][api/entities/status/:entity-type/:entity-name]]             | Get the status of the entity       | Owner/Group     |
+| [[restapi/EntityDefinition][api/entities/definition/:entity-type/:entity-name]]     | Get the definition of the entity   | Owner/Group     |
+| [[restapi/EntityList][api/entities/list/:entity-type?fields=:fields]]               | Get the list of entities           | Owner/Group     |
+| [[restapi/EntityDependencies][api/entities/dependencies/:entity-type/:entity-name]] | Get the dependencies of the entity | Owner/Group     |
+
+*REST Call on Feed and Process Instances*
+
+| *Resource*                                                                  | *Description*                | *Authorization* |
+| [[restapi/InstanceRunning][api/instance/running/:entity-type/:entity-name]] | List of running instances.   | Owner/Group     |
+| [[restapi/InstanceStatus][api/instance/status/:entity-type/:entity-name]]   | Status of a given instance   | Owner/Group     |
+| [[restapi/InstanceKill][api/instance/kill/:entity-type/:entity-name]]       | Kill a given instance        | Owner/Group     |
+| [[restapi/InstanceSuspend][api/instance/suspend/:entity-type/:entity-name]] | Suspend a running instance   | Owner/Group     |
+| [[restapi/InstanceResume][api/instance/resume/:entity-type/:entity-name]]   | Resume a given instance      | Owner/Group     |
+| [[restapi/InstanceRerun][api/instance/rerun/:entity-type/:entity-name]]     | Rerun a given instance       | Owner/Group     |
+| [[InstanceLogs][api/instance/logs/:entity-type/:entity-name]]               | Get logs of a given instance | Owner/Group     |
+
+---++++ Admin Resources Policy
+
+Only users belonging to admin users or groups have access to this resource. Admin membership is
+determined by a static configuration parameter.
+
+| *Resource*                                             | *Description*                               | *Authorization*  |
+| [[restapi/AdminStack][api/admin/stack]]                | Get stack of the server                     | Admin User/Group |
+| [[restapi/AdminVersion][api/admin/version]]            | Get version of the server                   | Admin User/Group |
+| [[restapi/AdminConfig][api/admin/config/:config-type]] | Get configuration information of the server | Admin User/Group |
+
+
+---++++ Lineage Resource Policy
+
+Lineage is read-only and hence all users can look at lineage for their respective entities.
+
+
+---++ Authentication Configuration
+
+Following is the Server Side Configuration Setup for Authentication.
 
 ---+++ Common Configuration Parameters
 
@@ -105,6 +218,51 @@ Falcon uses HTTP Kerberos SPNEGO to authenticate the user.
 *.falcon.http.authentication.blacklisted.users=
 </verbatim>
 
+---++ Authorization Configuration
+
+---+++ Enabling Authorization
+By default, support for authorization is disabled and specifying ACLs in entities are optional.
+To enable support for authorization, set falcon.security.authorization.enabled to true in the
+startup configuration.
+
+<verbatim>
+# Authorization Enabled flag: false|true
+*.falcon.security.authorization.enabled=true
+</verbatim>
+
+
+---+++ Authorization Provider
+
+Falcon provides a basic implementation for Authorization bundled, org.apache.falcon.security .DefaultFalconAuthorizationProvider.
+This can be overridden by custom implementations in the startup configuration.
+
+<verbatim>
+# Authorization Provider Fully Qualified Class Name
+*.falcon.security.authorization.provider=org.apache.falcon.security.DefaultAuthorizationProvider
+</verbatim>
+
+---+++ Admin Membership
+
+Administrative users are determined by the configuration:
+
+<verbatim>
+# Admin Users, comma separated users
+*.falcon.security.authorization.admin.users=falcon,ambari-qa,seetharam
+</verbatim>
+
+Administrative groups are determined by the configuration:
+
+<verbatim>
+# Admin Group Membership, comma separated users
+*.falcon.security.authorization.admin.groups=falcon,testgroup,staff
+</verbatim>
+
+
+---++ SSL
+
+Falcon provides transport level security ensuring data confidentiality and integrity. This is
+enabled by default for communicating over HTTP between the client and the server.
+
 ---+++ SSL Configuration
 
 <verbatim>

http://git-wip-us.apache.org/repos/asf/incubator-falcon/blob/7b3e5107/docs/src/site/twiki/index.twiki
----------------------------------------------------------------------
diff --git a/docs/src/site/twiki/index.twiki b/docs/src/site/twiki/index.twiki
index e7917c5..7437280 100644
--- a/docs/src/site/twiki/index.twiki
+++ b/docs/src/site/twiki/index.twiki
@@ -33,7 +33,7 @@ describes various options for the command line utility provided by Falcon.
 
 Falcon provides OOTB [[HiveIntegration][lifecycle management for Tables in Hive (HCatalog)]]
 such as table replication for BCP and table eviction. Falcon also enforces
-[[Security][kerberos authentication]] on protected resources and enables SSL.
+[[Security][Security]] on protected resources and enables SSL.
 
 #LicenseInfo
 ---+ Licensing Information