You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@manifoldcf.apache.org by kw...@apache.org on 2010/04/20 02:56:25 UTC

svn commit: r935779 [3/3] - in /incubator/lcf/site: publish/ publish/images/ src/documentation/content/xdocs/ src/documentation/resources/images/

Modified: incubator/lcf/site/src/documentation/content/xdocs/end-user-documentation.xml
URL: http://svn.apache.org/viewvc/incubator/lcf/site/src/documentation/content/xdocs/end-user-documentation.xml?rev=935779&r1=935778&r2=935779&view=diff
==============================================================================
--- incubator/lcf/site/src/documentation/content/xdocs/end-user-documentation.xml (original)
+++ incubator/lcf/site/src/documentation/content/xdocs/end-user-documentation.xml Tue Apr 20 00:56:24 2010
@@ -716,6 +716,128 @@
                 <br/>
                 <p>In other words, the Web connection type is neither as easy to configure, nor as well-targeted in its separation of links and data, as the RSS connection type.  For that
                        reason, we strongly encourage you to consider using the RSS connection type for all applications where it might reasonably apply.</p>
+                <p>Many users of the Web connection type set up their jobs to run continuously, configuring their jobs to occasionally refetch documents, or to not refetch documents
+                       ever, and expire them after some period of time.</p>
+                <p>A connection of the Web connection type has the following special tabs: "Email", "Robots", "Bandwidth", "Access Credentials", and "Certificates".  The "Email" tab
+                       looks like this:</p>
+                <br/><br/>
+                <figure src="images/web-configure-email.PNG" alt="Web Connection, Email tab" width="80%"/>
+                <br/><br/>
+                <p>Enter an email address.  This email address will be included in all requests made by the Web connection, so that webmasters can report any difficulties that their
+                       sites experience as the result of improper throttling, etc.</p>
+                <p>This field is mandatory.  While the Web connection type makes no effort to validate the correctness of the email
+                       field, you will probably want to remain a good web citizen and provide a valid email address.  Remember that it is very easy for a webmaster to block access to
+                       a crawler that does not seem to be behaving in a polite manner.</p>
+                <p>The "Robots" tab looks like this:</p>
+                <br/><br/>
+                <figure src="images/web-configure-robots.PNG" alt="Web Connection, Robots tab" width="80%"/>
+                <br/><br/>
+                <p>Select how the connection will interpret robots.txt.  Remember that you have an interest in crawling people's sites as politely as is possible.</p>
+                <p>The "Bandwidth" tab allows you to specify a list of bandwidth rules.  Each rule has a regular expression matched against a URL's throttle bin.
+                       Throttle bins, in connections of the Web type, are simply the server name part of the URL.  Each rule allows you to select a maximum bandwidth, number of
+                       connections, and fetch rate.  You can have as many rules as you like; if a URL matches more than one rule, then the most conservative value will be used.</p>
+                <p>This is what the "Bandwidth" tab looks like:</p>
+                <br/><br/>
+                <figure src="images/web-configure-bandwidth.PNG" alt="Web Connection, Bandwidth tab" width="80%"/>
+                <br/><br/>
+                <p>The screen shot shows the tab configured with a setting that is reasonably polite.  The default value for this tab is blank, meaning that, by default, there is no throttling
+                       whatsoever!  Please do not make the mistake of crawling other people's sites without adequate politeness parameters in place.</p>
+                <p>To add a rule, fill in the regular expression and the appropriate rule limit values, and click the "Add" button.</p>
+                <p>The "Bandwidth" tab is related to the throttles that you can set on the "Throttling" tab in the following ways:</p>
+                <br/>
+                <ul>
+                    <li>The "Bandwidth" tab sets the <b>maximum</b> values, while the "Throttling" tab sets the <b>average</b> values.</li>
+                    <li>The "Bandwidth" tab does not affect how documents are scheduled in the queue; it simply blocks documents until it is safe to go ahead, which will use up a crawler thread
+                           for the entire period that both the wait and the fetch take place.  The "Throttling" tab affects how often documents are scheduled, so it does not waste threads.</li>
+                </ul>
+                <br/>
+                <p>Because of the above, we suggest that you configure your Web connection using <b>both</b> the "Bandwidth" <b>and</b> the "Throttling" tabs.  Select maximum
+                       values on the "Bandwidth" tab, and corresponding average values estimates on the "Throttling" tab.  Remember that a document identifier with the Web connection type is the
+                       document's URL, and the bin name for that URL is the server name.  Also, please note that the "Maximum number of connections per JVM" field's default value of 10 is
+                       unlikely to be correct for connections of the Web type; you should have at least one available connection per worker thread, for best performance.  Since the
+                       default number of worker threads is 30, you should set this parameter to at least a value of 30 for normal operation.</p>
+                <p>The Web connection type's "Access Credentials" tab describes how pages get authenticated.  There is support on this tab for both page-based authentication (e.g.
+                       basic auth or all forms of NTLM), as well as session-based authentication (which involves the fetch of many pages to establish a logged-in session).  The initial
+                       appearance of the "Access Credentials" tab shows both kinds of authentication:</p>
+                <br/><br/>
+                <figure src="images/web-configure-access-credentials.PNG" alt="Web Connection, Access Credentials tab" width="80%"/>
+                <br/><br/>
+                <p>Each kind of authentication has its own list of rules.</p>
+                <p>Specifying a page authentication rule requires simply knowing what URLs are protected, and what the proper
+                       authentication method and credentials are for those URLs.  Enter a regular expression describing the protected URLs, and select the proper authentication method.
+                       Fill in the credentials.  Click the "Add" button.</p>
+                <p>Specifying a correct session authentication rule usually requires some research.  A single session-authentication rule usually corresponds to a single session-protected
+                       site.  For that site, you will need to be able to describe the following for session authentication to function:</p>
+                <br/>
+                <ul>
+                    <li>The URLs of pages that are protected by this particular site session security</li>
+                    <li>How to detect when a page fetch is part of the login sequence</li>
+                    <li>How to fill in the appropriate forms within the login sequence with appropriate login information</li>
+                </ul>
+                <br/>
+                <p>The Web connection type labels pages that are part of the login sequence "login pages", and pages that are protected site content "content pages".  The Web
+                       connection type will not attempt to index login pages.  They are special pages that have but one purpose: establishing an authenticated session.</p>
+                <p>If all this is not complicated enough, your research also has to cover two very different cases: when you are first entering the site anew, and second when you try to fetch
+                       a content page and you are no longer logged in, because your session has expired.  In both cases, the session authentication rule must be able to properly log in and
+                       fetch content, because you cannot control when a page will be fetched or refetched by the Framework.</p>
+                <p>You declare a page to be a login page by identifying it both by its URL, and by what the crawler finds on the page when it fetches it.  For example, some session-protected
+                       sites may redirect you to a login screen when your session expires.  So, instead of fetching content, you would be fetching a redirection to a specific page.  You do <b>not</b>
+                       want either the redirection, or the login screen, to be considered content pages.  The correct way to handle such a setup would be to declare one kind of login page to consist
+                       of a redirection to the login screen URL, and another kind of login page to consist of the login screen URL with the appropriate form.  Furthermore, you would want to supply
+                       the correct login data for the form, and allow the form to be submitted, and so the login form's target may also need to be declared as a login page.</p>
+                <p>The kinds of content that the Web connection type can recognize as a login page are the following:</p>
+                <br/>
+                <ul>
+                    <li>A redirection to a specific URL, as described by a regular expression</li>
+                    <li>A page that has a form of a particular name on it, as described by a regular expression</li>
+                    <li>A page that has a link on it to a specific target, as described by a regular expression</li>
+                </ul>
+                <br/>
+                <p>To add a session authentication rule, fill in a regular expression describing the site pages that are being protected, and click the "Add" button:</p>
+                <br/><br/>
+                <figure src="images/web-configure-access-credentials-session.PNG" alt="Web Connection, Access Credentials tab" width="80%"/>
+                <br/><br/>
+                <p>Note that you can now add login page descriptions to the newly-created rule.  To add a login page description, enter a URL regular expression, a type of login page, a
+                       target link or form name regular expression, and click the "Add" button.</p>
+                <p>When you add a login page of the "form" type, you can then add form fill-in information to the login page, as seen below:</p>
+                <br/><br/>
+                <figure src="images/web-configure-access-credentials-session-form.PNG" alt="Web Connection, Access Credentials tab" width="80%"/>
+                <br/><br/>
+                <p>Supply a regular expression for the name of the form element you want to set, and also provide a value.  If you want the value to not be visible in clear text, fill in the
+                       "password" column instead of the "value" column.  You can usually figure out the name of the form and its elements by viewing the source of the HTML page in a
+                       browser.  When you are done, click the "Add" button.</p>
+                <p>Form data that is not specified will be posted with the default value determined by the HTML of the page.  The Web connection type is unable, at this time, to execute
+                       Javascript, and therefore you may need to fill out some form values that are filled in by Javascript in order to get the form to post in a useful way.  If you have a form
+                       that relies heavily on Javascript to post properly, you may need considerable effort and web programming skills to figure out how to get these forms to post properly
+                       with the Web Connector.  Luckily, such obfuscated login screens are still rare.</p>
+                <p>A series of login pages form a "login page sequence" for the site.  For each login page, the Web connection decides what page to fetch next by what you specified for
+                       the login page criteria.  So, for a redirection to a specific URL, the next page to be fetched will be that redirected URL.  For a form, the next page fetched will be the
+                       action page indicated by the specified form.  For a link to a target, the next page fetched will be the target URL.  When the login page sequence ends, the next page
+                       fetched after that will be the original content page that the Web connection was trying to fetch when the login sequence started.</p>
+                <p>Debugging session authentication problems is best done by looking at a Simple History report for your Web connection.  The Web connection type records several
+                       types of events which, between them, can give a very strong picture of what is happening.  These event types are as follows:</p>
+                <br/>
+                <table>
+                    <tr><td><b>Event type</b></td><td><b>Meaning</b></td></tr>
+                    <tr><td>Fetch</td><td>This event records the fetch of a URL.  The HTTP response is recorded as the response code.  In addition, there are several negative
+                        code values which the connect generates when the HTTP operation cannot be done or does not complete.</td></tr>
+                    <tr><td>Begin login</td><td>This event occurs when the connection detects the transition to a login page sequence.  When a login sequence is entered, no other
+                        pages from that protected site will be fetched until the login sequence is completed.</td></tr>
+                    <tr><td>End login</td><td>This event occurs when the connection detects the transition from a login page sequence back to normal content fetching.  When this
+                        occurs, simultaneous fetching for pages from the site are re-enabled.</td></tr>
+                </table>
+                <br/>
+                <p>The "Certificates" tab is used in conjunction with SSL, and permits you to define independent trust certificate stores for URLs matching specified regular expressions.
+                       You can also allow the connection to trust all certificates it sees, if you so choose.  The "Certificates" tab looks like this:</p>
+                <br/><br/>
+                <figure src="images/web-configure-certificates.PNG" alt="Web Connection, Certificates tab" width="80%"/>
+                <br/><br/>
+                <p>Type in a URL regular expression, and either check the "Trust everything" box, or browse for the appropriate certificate authority certificate that you wish to trust.  (It will
+                       also work to simply trust a server's certificate, but that certificate may change from time to time, as it expires.)  Click "Add" to add the certificate rule to the list.</p>
+                <p>When you are done, and you click the "Save" button, you will see a summary page looking something like this:</p>
+                <br/><br/>
+                <figure src="images/web-status.PNG" alt="Web Status" width="80%"/>
+                <br/><br/>
 
                 <p>More here later</p>
 

Added: incubator/lcf/site/src/documentation/resources/images/web-configure-access-credentials-session-form.PNG
URL: http://svn.apache.org/viewvc/incubator/lcf/site/src/documentation/resources/images/web-configure-access-credentials-session-form.PNG?rev=935779&view=auto
==============================================================================
Binary file - no diff available.

Propchange: incubator/lcf/site/src/documentation/resources/images/web-configure-access-credentials-session-form.PNG
------------------------------------------------------------------------------
    svn:mime-type = application/octet-stream

Added: incubator/lcf/site/src/documentation/resources/images/web-configure-access-credentials-session.PNG
URL: http://svn.apache.org/viewvc/incubator/lcf/site/src/documentation/resources/images/web-configure-access-credentials-session.PNG?rev=935779&view=auto
==============================================================================
Binary file - no diff available.

Propchange: incubator/lcf/site/src/documentation/resources/images/web-configure-access-credentials-session.PNG
------------------------------------------------------------------------------
    svn:mime-type = application/octet-stream

Added: incubator/lcf/site/src/documentation/resources/images/web-configure-access-credentials.PNG
URL: http://svn.apache.org/viewvc/incubator/lcf/site/src/documentation/resources/images/web-configure-access-credentials.PNG?rev=935779&view=auto
==============================================================================
Binary file - no diff available.

Propchange: incubator/lcf/site/src/documentation/resources/images/web-configure-access-credentials.PNG
------------------------------------------------------------------------------
    svn:mime-type = application/octet-stream

Added: incubator/lcf/site/src/documentation/resources/images/web-configure-bandwidth.PNG
URL: http://svn.apache.org/viewvc/incubator/lcf/site/src/documentation/resources/images/web-configure-bandwidth.PNG?rev=935779&view=auto
==============================================================================
Binary file - no diff available.

Propchange: incubator/lcf/site/src/documentation/resources/images/web-configure-bandwidth.PNG
------------------------------------------------------------------------------
    svn:mime-type = application/octet-stream

Added: incubator/lcf/site/src/documentation/resources/images/web-configure-certificates.PNG
URL: http://svn.apache.org/viewvc/incubator/lcf/site/src/documentation/resources/images/web-configure-certificates.PNG?rev=935779&view=auto
==============================================================================
Binary file - no diff available.

Propchange: incubator/lcf/site/src/documentation/resources/images/web-configure-certificates.PNG
------------------------------------------------------------------------------
    svn:mime-type = application/octet-stream

Added: incubator/lcf/site/src/documentation/resources/images/web-configure-email.PNG
URL: http://svn.apache.org/viewvc/incubator/lcf/site/src/documentation/resources/images/web-configure-email.PNG?rev=935779&view=auto
==============================================================================
Binary file - no diff available.

Propchange: incubator/lcf/site/src/documentation/resources/images/web-configure-email.PNG
------------------------------------------------------------------------------
    svn:mime-type = application/octet-stream

Added: incubator/lcf/site/src/documentation/resources/images/web-configure-robots.PNG
URL: http://svn.apache.org/viewvc/incubator/lcf/site/src/documentation/resources/images/web-configure-robots.PNG?rev=935779&view=auto
==============================================================================
Binary file - no diff available.

Propchange: incubator/lcf/site/src/documentation/resources/images/web-configure-robots.PNG
------------------------------------------------------------------------------
    svn:mime-type = application/octet-stream

Added: incubator/lcf/site/src/documentation/resources/images/web-status.PNG
URL: http://svn.apache.org/viewvc/incubator/lcf/site/src/documentation/resources/images/web-status.PNG?rev=935779&view=auto
==============================================================================
Binary file - no diff available.

Propchange: incubator/lcf/site/src/documentation/resources/images/web-status.PNG
------------------------------------------------------------------------------
    svn:mime-type = application/octet-stream