You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@manifoldcf.apache.org by kw...@apache.org on 2013/03/31 14:45:29 UTC

svn commit: r1462937 - in /manifoldcf/trunk: ./ site/src/documentation/content/xdocs/en_US/ site/src/documentation/resources/images/en_US/

Author: kwright
Date: Sun Mar 31 12:45:28 2013
New Revision: 1462937

URL: http://svn.apache.org/r1462937
Log:
Fix for CONNECTORS-670.  Update end-user documentation for RSS connector.

Added:
    manifoldcf/trunk/site/src/documentation/resources/images/en_US/rss-job-exclusions.PNG   (with props)
Modified:
    manifoldcf/trunk/CHANGES.txt
    manifoldcf/trunk/site/src/documentation/content/xdocs/en_US/end-user-documentation.xml
    manifoldcf/trunk/site/src/documentation/resources/images/en_US/rss-configure-bandwidth.PNG
    manifoldcf/trunk/site/src/documentation/resources/images/en_US/rss-configure-email.PNG
    manifoldcf/trunk/site/src/documentation/resources/images/en_US/rss-configure-proxy.PNG
    manifoldcf/trunk/site/src/documentation/resources/images/en_US/rss-configure-robots.PNG
    manifoldcf/trunk/site/src/documentation/resources/images/en_US/rss-job-canonicalization.PNG
    manifoldcf/trunk/site/src/documentation/resources/images/en_US/rss-job-dechromed-content.PNG
    manifoldcf/trunk/site/src/documentation/resources/images/en_US/rss-job-mappings.PNG
    manifoldcf/trunk/site/src/documentation/resources/images/en_US/rss-job-metadata.PNG
    manifoldcf/trunk/site/src/documentation/resources/images/en_US/rss-job-security.PNG
    manifoldcf/trunk/site/src/documentation/resources/images/en_US/rss-job-time-values.PNG
    manifoldcf/trunk/site/src/documentation/resources/images/en_US/rss-job-urls.PNG
    manifoldcf/trunk/site/src/documentation/resources/images/en_US/rss-status.PNG

Modified: manifoldcf/trunk/CHANGES.txt
URL: http://svn.apache.org/viewvc/manifoldcf/trunk/CHANGES.txt?rev=1462937&r1=1462936&r2=1462937&view=diff
==============================================================================
--- manifoldcf/trunk/CHANGES.txt (original)
+++ manifoldcf/trunk/CHANGES.txt Sun Mar 31 12:45:28 2013
@@ -3,6 +3,9 @@ $Id$
 
 ======================= 1.2-dev =====================
 
+CONNECTORS-670: Update end-user documentation for RSS connector.
+(Karl Wright)
+
 CONNECTORS-642: Add Elastic Search plugin.
 (Simon Willnauer, Piergiorgio Lucidi, Karl Wright)
 

Modified: manifoldcf/trunk/site/src/documentation/content/xdocs/en_US/end-user-documentation.xml
URL: http://svn.apache.org/viewvc/manifoldcf/trunk/site/src/documentation/content/xdocs/en_US/end-user-documentation.xml?rev=1462937&r1=1462936&r2=1462937&view=diff
==============================================================================
--- manifoldcf/trunk/site/src/documentation/content/xdocs/en_US/end-user-documentation.xml (original)
+++ manifoldcf/trunk/site/src/documentation/content/xdocs/en_US/end-user-documentation.xml Sun Mar 31 12:45:28 2013
@@ -912,25 +912,25 @@
                 <br/><br/>
                 <figure src="images/en_US/rss-configure-bandwidth.PNG" alt="RSS Connection, Bandwidth tab" width="80%"/>
                 <br/><br/>
-                <p>This tab allows you to control the <b>maximum</b> rate at which the connection fetches data, on a per-server basis, as well as the <b>maximum</b> fetches per minute,
-                       also per-server.  Finally, the maximum number of socket connections made per server at any one time is also controllable by this tab.</p>
-                <p>The screen shot displays parameters that are
-                       considered reasonably polite.  The default values for this table are all blank, meaning that, by default, there is no throttling whatsoever!  Please do not make the mistake
-                       of crawling other people's sites without adequate politeness parameters in place.</p>
-                <p>The "Throttle group" parameter allows you to treat multiple RSS-type connections together, for the purposes of throttling.  All RSS-type connections that have the same
-                       throttle group name will use the same pool for throttling purposes.</p>
+                <p>This tab allows you to control the <b>maximum</b> rate at which the connection fetches data, on a per-server basis, as well as the <b>maximum</b> fetches
+                       per minute, also per-server.  Finally, the maximum number of socket connections made per server at any one time is also controllable by this tab.</p>
+                <p>The screen shot displays parameters that are considered reasonably polite.  The default values for this table are all blank, meaning that, by default, there is no
+                       throttling whatsoever!  Please do not make the mistake of crawling other people's sites without adequate politeness parameters in place.</p>
+                <p>The "Throttle group" parameter allows you to treat multiple RSS-type connections together, for the purposes of throttling.  All RSS-type connections that
+                       have the same throttle group name will use the same pool for throttling purposes.</p>
                 <p>The "Bandwidth" tab is related to the throttles that you can set on the "Throttling" tab in the following ways:</p>
                 <br/>
                 <ul>
                     <li>The "Bandwidth" tab sets the <b>maximum</b> values, while the "Throttling" tab sets the <b>average</b> values.</li>
-                    <li>The "Bandwidth" tab does not affect how documents are scheduled in the queue; it simply blocks documents until it is safe to go ahead, which will use up a crawler thread
-                           for the entire period that both the wait and the fetch take place.  The "Throttling" tab affects how often documents are scheduled, so it does not waste threads.</li>
+                    <li>The "Bandwidth" tab does not affect how documents are scheduled in the queue; it simply blocks documents until it is safe to go ahead, which will use up a
+                          crawler thread for the entire period that both the wait and the fetch take place.  The "Throttling" tab affects how often documents are scheduled, so it does
+                          not waste threads.</li>
                 </ul>
                 <br/>
                 <p>Because of the above, we suggest that you configure your RSS connection using <b>both</b> the "Bandwidth" <b>and</b> the "Throttling" tabs.  Select maximum
                        values on the "Bandwidth" tab, and corresponding average values estimates on the "Throttling" tab.  Remember that a document identifier for an RSS connection is the
-                       document's URL, and the bin name for that URL is the server name.  Also, please note that the "Maximum number of connections per JVM" field's default value of 10 is
-                       unlikely to be correct for connections of the RSS type; you should have at least one available connection per worker thread, for best performance.  Since the
+                       document's URL, and the bin name for that URL is the server name.  Also, please note that the "Maximum number of connections per JVM" field's default value
+                       of 10 is unlikely to be correct for connections of the RSS type; you should have at least one available connection per worker thread, for best performance.  Since the
                        default number of worker threads is 30, you should set this parameter to at least a value of 30 for normal operation.</p>
                 <p>The "Proxy" tab allows you to specify a proxy that you want to crawl through.  The RSS connection type supports proxies that are secured with all forms of the NTLM
                        authentication method.  This is quite typical of large organizations.  The tab looks like this:</p>
@@ -944,8 +944,8 @@
                 <figure src="images/en_US/rss-status.PNG" alt="RSS Status" width="80%"/>
                 <br/><br/>
                 <p></p>
-                <p>Jobs created using connections of the RSS type have the following additional tabs: "URLs", "Canonicalization", "Mappings", "Time Values", "Security", "Metadata", and
-                       "Dechromed Content".  The URLs tab is where you describe the feeds that are part of the job.  It looks like this:</p>
+                <p>Jobs created using connections of the RSS type have the following additional tabs: "URLs", "Canonicalization", "URL mappings", "Exclusions", "Time Values",
+                       "Security", "Metadata", and "Dechromed Content".  The URLs tab is where you describe the feeds that are part of the job.  It looks like this:</p>
                 <br/><br/>
                 <figure src="images/en_US/rss-job-urls.PNG" alt="RSS job, URLs tab" width="80%"/>
                 <br/><br/>
@@ -976,6 +976,12 @@
                        <code>http://Server/Folder_1/Filename</code>, it would output the string <code>http://Folder_1/Filename</code>.</p>
                 <p>If more than one rule is present, the rules are all executed in sequence.  That is, the output of the first rule is modified by the second rule, etc.</p>
                 <p>To add a rule, fill in the match expression and output string, and click the "Add" button.</p>
+                <p>The "Exclusions" tab looks like this:</p>
+                <br/><br/>
+                <figure src="images/en_US/rss-job-exclusions.PNG" alt="RSS job, Exclusions tab" width="80%"/>
+                <br/><br/>
+                <p>Here you can enter a set of regular expressions, one per line, which describe which document URLs to exclude from the job.  This can be very helpful if you
+                     are crawling RSS feeds that include a variety of content where you only want to index a subset of the content.</p>
                 <p>The "Time Values" tab looks like this:</p>
                 <br/><br/>
                 <figure src="images/en_US/rss-job-time-values.PNG" alt="RSS job, Time Values tab" width="80%"/>

Modified: manifoldcf/trunk/site/src/documentation/resources/images/en_US/rss-configure-bandwidth.PNG
URL: http://svn.apache.org/viewvc/manifoldcf/trunk/site/src/documentation/resources/images/en_US/rss-configure-bandwidth.PNG?rev=1462937&r1=1462936&r2=1462937&view=diff
==============================================================================
Binary files - no diff available.

Modified: manifoldcf/trunk/site/src/documentation/resources/images/en_US/rss-configure-email.PNG
URL: http://svn.apache.org/viewvc/manifoldcf/trunk/site/src/documentation/resources/images/en_US/rss-configure-email.PNG?rev=1462937&r1=1462936&r2=1462937&view=diff
==============================================================================
Binary files - no diff available.

Modified: manifoldcf/trunk/site/src/documentation/resources/images/en_US/rss-configure-proxy.PNG
URL: http://svn.apache.org/viewvc/manifoldcf/trunk/site/src/documentation/resources/images/en_US/rss-configure-proxy.PNG?rev=1462937&r1=1462936&r2=1462937&view=diff
==============================================================================
Binary files - no diff available.

Modified: manifoldcf/trunk/site/src/documentation/resources/images/en_US/rss-configure-robots.PNG
URL: http://svn.apache.org/viewvc/manifoldcf/trunk/site/src/documentation/resources/images/en_US/rss-configure-robots.PNG?rev=1462937&r1=1462936&r2=1462937&view=diff
==============================================================================
Binary files - no diff available.

Modified: manifoldcf/trunk/site/src/documentation/resources/images/en_US/rss-job-canonicalization.PNG
URL: http://svn.apache.org/viewvc/manifoldcf/trunk/site/src/documentation/resources/images/en_US/rss-job-canonicalization.PNG?rev=1462937&r1=1462936&r2=1462937&view=diff
==============================================================================
Binary files - no diff available.

Modified: manifoldcf/trunk/site/src/documentation/resources/images/en_US/rss-job-dechromed-content.PNG
URL: http://svn.apache.org/viewvc/manifoldcf/trunk/site/src/documentation/resources/images/en_US/rss-job-dechromed-content.PNG?rev=1462937&r1=1462936&r2=1462937&view=diff
==============================================================================
Binary files - no diff available.

Added: manifoldcf/trunk/site/src/documentation/resources/images/en_US/rss-job-exclusions.PNG
URL: http://svn.apache.org/viewvc/manifoldcf/trunk/site/src/documentation/resources/images/en_US/rss-job-exclusions.PNG?rev=1462937&view=auto
==============================================================================
Binary file - no diff available.

Propchange: manifoldcf/trunk/site/src/documentation/resources/images/en_US/rss-job-exclusions.PNG
------------------------------------------------------------------------------
    svn:mime-type = application/octet-stream

Modified: manifoldcf/trunk/site/src/documentation/resources/images/en_US/rss-job-mappings.PNG
URL: http://svn.apache.org/viewvc/manifoldcf/trunk/site/src/documentation/resources/images/en_US/rss-job-mappings.PNG?rev=1462937&r1=1462936&r2=1462937&view=diff
==============================================================================
Binary files - no diff available.

Modified: manifoldcf/trunk/site/src/documentation/resources/images/en_US/rss-job-metadata.PNG
URL: http://svn.apache.org/viewvc/manifoldcf/trunk/site/src/documentation/resources/images/en_US/rss-job-metadata.PNG?rev=1462937&r1=1462936&r2=1462937&view=diff
==============================================================================
Binary files - no diff available.

Modified: manifoldcf/trunk/site/src/documentation/resources/images/en_US/rss-job-security.PNG
URL: http://svn.apache.org/viewvc/manifoldcf/trunk/site/src/documentation/resources/images/en_US/rss-job-security.PNG?rev=1462937&r1=1462936&r2=1462937&view=diff
==============================================================================
Binary files - no diff available.

Modified: manifoldcf/trunk/site/src/documentation/resources/images/en_US/rss-job-time-values.PNG
URL: http://svn.apache.org/viewvc/manifoldcf/trunk/site/src/documentation/resources/images/en_US/rss-job-time-values.PNG?rev=1462937&r1=1462936&r2=1462937&view=diff
==============================================================================
Binary files - no diff available.

Modified: manifoldcf/trunk/site/src/documentation/resources/images/en_US/rss-job-urls.PNG
URL: http://svn.apache.org/viewvc/manifoldcf/trunk/site/src/documentation/resources/images/en_US/rss-job-urls.PNG?rev=1462937&r1=1462936&r2=1462937&view=diff
==============================================================================
Binary files - no diff available.

Modified: manifoldcf/trunk/site/src/documentation/resources/images/en_US/rss-status.PNG
URL: http://svn.apache.org/viewvc/manifoldcf/trunk/site/src/documentation/resources/images/en_US/rss-status.PNG?rev=1462937&r1=1462936&r2=1462937&view=diff
==============================================================================
Binary files - no diff available.