You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@manifoldcf.apache.org by kw...@apache.org on 2013/07/01 06:46:13 UTC

svn commit: r1498214 - /manifoldcf/trunk/site/src/documentation/content/xdocs/en_US/concepts.xml

Author: kwright
Date: Mon Jul  1 04:46:13 2013
New Revision: 1498214

URL: http://svn.apache.org/r1498214
Log:
Update concepts document.  Part of CONNECTORS-743.

Modified:
    manifoldcf/trunk/site/src/documentation/content/xdocs/en_US/concepts.xml

Modified: manifoldcf/trunk/site/src/documentation/content/xdocs/en_US/concepts.xml
URL: http://svn.apache.org/viewvc/manifoldcf/trunk/site/src/documentation/content/xdocs/en_US/concepts.xml?rev=1498214&r1=1498213&r2=1498214&view=diff
==============================================================================
--- manifoldcf/trunk/site/src/documentation/content/xdocs/en_US/concepts.xml (original)
+++ manifoldcf/trunk/site/src/documentation/content/xdocs/en_US/concepts.xml Mon Jul  1 04:46:13 2013
@@ -53,15 +53,27 @@
       <section>
         <title>ManifoldCF security model</title>
         <p></p>
-        <p>The ManifoldCF security model is based loosely on the standard authorization concepts and hierarchies found in Microsoft's Active Directory.  Active Directory is quite common in the kinds of environments where data repositories exist that are ripe for indexing.  Active Directory's authorization model is also easily used in a general way to represent authorization for a huge variety of third-party content repositories.</p>
-        <p></p>
-        <p>ManifoldCF defines a concept of an <em>access token</em>.  An access token, to ManifoldCF, is a string which is meaningful only to a specific connector or connectors.  This string describes the ability of a user to view (or not view) some set of documents.  For documents protected by Active Directory itself, an access token would be an Active Directory SID (e.g. "S-1-23-4-1-45").  But, for example, for documents protected by Livelink a wholly different string would be used.</p>
-        <p></p>
-        <p>In the ManifoldCF security model, it is the job of an <em>authority</em> to provide a list of access tokens for a given searching user.  Multiple authorities cooperate in that each one can add to the list of access tokens describing a given user's security.  The resulting access tokens are handed to the search engine as part of every search request, so that the search engine may properly exclude documents that the user is not allowed to see.</p>
-        <p></p>
-        <p>When document indexing is done, therefore, it is the job of the crawler to hand access tokens to the search engine, so that it may categorize the documents properly according to their accessibility.  Note that the access tokens so provided are meaningful only within the space of the governing authority.  Access tokens can be provided as "grant" tokens, or as "deny" tokens.  Finally, there are multiple levels of tokens, which correspond to Active Directory's concepts of "share" security, "directory" security, or "file" security.  (The latter concepts are rarely used except for documents that come from Windows or Samba systems.)</p>
-        <p></p>
-        <p>Once all these documents and their access tokens are handed to the search engine, it is the search engine's job to enforce security by excluding inappropriate documents from the search results.  For Solr 1.5, this infrastructure has been submitted in jira ticket SOLR-1895, found <a href="https://issues.apache.org/jira/browse/SOLR-1895">here</a>, where you can download a SearchComponent plug-in and simple instructions for setting up your copy of Solr to enforce ManifoldCF's model of document security.  Bear in mind that this plug-in is still not a complete solution, as it requires an authenticated user name to be passed to it from some upstream source, possibly a JAAS authenticator within an application server framework.</p>
+        <p>The ManifoldCF security model is based loosely on the standard authorization concepts and hierarchies found in Microsoft's Active Directory.  Active Directory is quite
+          common in the kinds of environments where data repositories exist that are ripe for indexing.  Active Directory's authorization model is also easily used in a general way to 
+          represent authorization for a huge variety of third-party content repositories.</p>
+        <p></p>
+        <p>ManifoldCF defines a concept of an <em>access token</em>.  An access token, to ManifoldCF, is a string which is meaningful only to a specific connector or
+          connectors.  This string describes the ability of a user to view (or not view) some set of documents.  For documents protected by Active Directory itself, an access token
+          would be an Active Directory SID (e.g. "S-1-23-4-1-45").  But, for example, for documents protected by Livelink a wholly different string would be used.</p>
+        <p></p>
+        <p>In the ManifoldCF security model, it is the job of an <em>authority</em> to provide a list of access tokens for a given searching user.  Multiple authorities cooperate
+          in that each one can add to the list of access tokens describing a given user's security.  The resulting access tokens are handed to the search engine as part of every
+          search request, so that the search engine may properly exclude documents that the user is not allowed to see.</p>
+        <p></p>
+        <p>When document indexing is done, therefore, it is the job of the crawler to hand access tokens to the search engine, so that it may categorize the documents properly
+          according to their accessibility.  Note that the access tokens so provided are meaningful only within the space of the governing authority.  Access tokens can be provided
+          as "grant" tokens, or as "deny" tokens.  Finally, there are multiple levels of tokens, which correspond to Active Directory's concepts of "share" security, "directory" security,
+          or "file" security.  (The latter concepts are rarely used except for documents that come from Windows or Samba systems.)</p>
+        <p></p>
+        <p>Once all these documents and their access tokens are handed to the search engine, it is the search engine's job to enforce security by excluding inappropriate documents
+          from the search results.  For Solr and for ElasticSearch, this infrastructure has been included in ManifoldCF releases as a Solr plugin (both 3.x and 4.x varieties) and an
+          ElasticSearch plugin.  Bear in mind that this plug-in is still not a complete solution, as it requires an authenticated user
+          name to be passed to it from some upstream source, possibly a JAAS authenticator within an application server framework.</p>
         <p></p>
       </section>
       <section>
@@ -70,20 +82,25 @@
         <section>
           <title>Connectors</title>
           <p></p>
-          <p>ManifoldCF defines three different kinds of connectors.  These are:</p>
+          <p>ManifoldCF defines four different kinds of connectors.  These are:</p>
           <p></p>
           <ul>
+            <li>User mapping connectors</li>
             <li>Authority connectors</li>
             <li>Repository connectors</li>
             <li>Output connectors</li>
           </ul>
           <p></p>
-          <p>All connectors share certain characteristics.  First, they are pooled.  This means that ManifoldCF keeps configured and connected instances of a connector around for a while, and has the ability to limit the total number of such instances to within some upper limit.  Connector implementations have specific methods in them for managing their existence in the pools that ManifoldCF keeps them in.  Second, they are configurable.  The configuration description for a connector is an XML document, whose precise format is determined by the connector implementation.  A configured connector instance is called a <em>connection</em>, by common ManifoldCF convention.</p>
+          <p>All connectors share certain characteristics.  First, they are pooled.  This means that ManifoldCF keeps configured and connected instances of a connector around for
+            a while, and has the ability to limit the total number of such instances to within some upper limit.  Connector implementations have specific methods in them for managing
+            their existence in the pools that ManifoldCF keeps them in.  Second, they are configurable.  The configuration description for a connector is an XML document, whose precise
+            format is determined by the connector implementation.  A configured connector instance is called a <em>connection</em>, by common ManifoldCF convention.</p>
           <p></p>
           <p>The function of each type of connector is described below.</p>
           <p></p>
           <table>
             <tr><th>Connector type</th><th>Function</th></tr>
+            <tr><td>User mapping connector</td><td>Maps a user name to another (equivalent) user name, typically by means of a regular expression mechanism, or by repository access</td></tr>
             <tr><td>Authority connector</td><td>Furnishes a standard way of mapping a user name to access tokens that are meaningful for a given type of repository</td></tr>
             <tr><td>Repository connector</td><td>Fetches documents from a specific kind of repository, such as SharePoint or off the web</td></tr>
             <tr><td>Output connector</td><td>Pushes document ingestion requests and deletion requests to a specific kind of back end search engine or other entity, such as Lucene</td></tr>
@@ -93,17 +110,27 @@
         <section>
           <title>Connections</title>
           <p></p>
-          <p>As described above, a <em>connection</em> is a connector implementation plus connector-specific configuration information.  A user can define a connection of all three types in the crawler UI.</p>
-          <p></p>
-          <p>The kind of information included in the configuration data for a connector typically describes the "how", as opposed to the "what".  For example, you'd configure a LiveLink connection by specifying how to talk to the LiveLink server.  You would <strong>not</strong> include information about which documents to select in such a configuration.</p>
+          <p>As described above, a <em>connection</em> is a connector implementation plus connector-specific configuration information.  A user can define a connection of all
+            three types in the crawler UI.</p>
           <p></p>
-          <p>There is one difference between how you define a <em>repository connection</em>, vs. how you would define an <em>authority connection</em> or <em>output connection</em>.  The difference is that you must specify a governing authority connection for your repository connection.  This is because <strong>all</strong> documents ingested by ManifoldCF need to include appropriate access tokens, and those access tokens are specific to the governing authority.</p>
+          <p>The kind of information included in the configuration data for a connector typically describes the "how", as opposed to the "what".  For example, you'd configure a
+            LiveLink connection by specifying how to talk to the LiveLink server.  You would <strong>not</strong> include information about which documents to select in such a
+            configuration.</p>
+          <p></p>
+          <p>There is one difference between how you define a <em>repository connection</em>, vs. how you would define an <em>authority connection</em> or <em>output
+            connection</em> or <em>mapping connection</em>.  The difference is that you must specify a governing authority connection for your repository connection.  This is
+            because <strong>all</strong> documents ingested by ManifoldCF need to include appropriate access tokens, and those access tokens are specific to the governing authority.</p>
+          <p></p>
+          <p>Another difference in how you define an <em>authority connection</em> or <em>mapping connection</em>, vs. other connections, is that you can specify a prerequisite
+            <em>mapping connection</em> that must occur beforehand.  This means you can have multiple user mappings that occur in a defined sequence, before the authority is
+            invoked.</p>
           <p></p>
         </section>
         <section>
           <title>Jobs</title>
           <p></p>
-          <p>A <em>job</em> in ManifoldCF parlance is a description of some kind of synchronization that needs to occur between a specified repository connection and a specified output connection.  A job includes the following:</p>
+          <p>A <em>job</em> in ManifoldCF parlance is a description of some kind of synchronization that needs to occur between a specified repository connection and a specified
+            output connection.  A job includes the following:</p>
           <p></p>
           <ul>
             <li>A verbal description</li>
@@ -114,7 +141,8 @@
             <li>A schedule for when the job will run: either within specified time windows, or on demand</li>
           </ul>
           <p></p>
-          <p>Jobs are allowed to share the same repository connection, and thus they can overlap in the set of documents they describe.  ManifoldCF permits this situation, although when it occurs it is probably an accident.</p>
+          <p>Jobs are allowed to share the same repository connection, and thus they can overlap in the set of documents they describe.  ManifoldCF permits this situation, although 
+            when it occurs it is probably an accident.</p>
         </section>
       </section>
     </section>