You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@nutch.apache.org by le...@apache.org on 2014/06/09 15:35:38 UTC
svn commit: r1601376 - /nutch/cms_site/trunk/content/bot.md
Author: lewismc
Date: Mon Jun 9 13:35:38 2014
New Revision: 1601376
URL: http://svn.apache.org/r1601376
Log:
Test formatting on bot.html
Modified:
nutch/cms_site/trunk/content/bot.md
Modified: nutch/cms_site/trunk/content/bot.md
URL: http://svn.apache.org/viewvc/nutch/cms_site/trunk/content/bot.md?rev=1601376&r1=1601375&r2=1601376&view=diff
==============================================================================
--- nutch/cms_site/trunk/content/bot.md (original)
+++ nutch/cms_site/trunk/content/bot.md Mon Jun 9 13:35:38 2014
@@ -18,66 +18,62 @@ specific language governing permissions
under the License.
-->
- <!-- Subhead
+<!-- Subhead
================================================== -->
- <header class="jumbotron subhead" id="overview">
- <div class="container">
- <h1>Nutch Robot</h1>
- <p class="lead">A page for SysAdmins/WebMasters and other angry
- people... ;)</p>
- </div>
- </header>
+<header class="jumbotron subhead" id="overview">
+ <div class="container">
+ <h1>Nutch Robot</h1>
+ <p class="lead">A page for SysAdmins/WebMasters and other angry people... ;)</p>
+ </div>
+</header>
- <div class="container">
- <!-- Typography ================================================== -->
- <section id="application">
- <div class="page-header">
- <h1>Introduction</h1>
- <p>If you're reading this, chances are you've seen a Nutch-based
- robot visiting your site while looking through your server logs.
- Our software obeys robots.txt files and robot META tags in HTML.
- These are the standard mechanisms for webmasters to tell web robots
- which portions of a site a robot is welcome to access.</p>
- <h1>Sysadmins/robots.txt</h1>
- <p>
- We're a software project, not a service, so please understand that
- a misbehaving crawler appearing with our Agent string is not run by
- us. Our software may be run by anyone. However, we'd still like to
- hear about any bad behavior. If possible, please include the name
- of the domain and some representative log entries. We can be
- reached at
- <code>dev [at] nutch [dot] apache [dot] org</code>
- .
- </p>
- <p>
- Our software obeys the robots.txt exclusion standard, described at
- <a href="http://www.robotstxt.org/wc/exclusion.html#robotstxt">
- http://www.robotstxt.org/wc/exclusion.html#robotstxt</a>. Different
- installations of the Nutch software may specify different agent
- names, but all should respond to the agent name "Nutch". Thus to
- ban all Nutch-based crawlers from your site, place the following in
- your robots.txt file:
- </p>
- <pre>User-agent: Nutch<br>Disallow: /</pre>
- </div>
- <div class="page-header">
- <h1>Webmasters/Robots META</h1>
- <p>
- If you do not have permission to edit the /robots.txt file on your
- server, you can still tell robots not to index your pages or follow
- your links. The standard mechanism for this is the robots META tag,
- as described at<a href="http://www.robotstxt.org/wc/meta-user.html">
- http://www.robotstxt.org/wc/meta-user.html</a>.
- </p>
- </div>
- <div class="page-header">
- <h1>Contact us</h1>
- <p>
- If your site has problems or questions about the Nutch crawler,
- please send an email to the
- <code>agent [at] nutch [dot] apache [dot] org</code>
- - Nutch agent mailing list.
- </p>
- </div>
- </section>
- </div>
+<div class="container">
+ <!-- Typography ================================================== -->
+ <section id="application">
+ <div class="page-header">
+ <h1>Introduction</h1>
+ <p>If you're reading this, chances are you've seen a Nutch-based
+ robot visiting your site while looking through your server logs.
+ Our software obeys robots.txt files and robot META tags in HTML.
+ These are the standard mechanisms for webmasters to tell web robots
+ which portions of a site a robot is welcome to access.</p>
+ <h1>Sysadmins/robots.txt</h1>
+ <p>
+ We're a software project, not a service, so please understand that
+ a misbehaving crawler appearing with our Agent string is not run by
+ us. Our software may be run by anyone. However, we'd still like to
+ hear about any bad behavior. If possible, please include the name
+ of the domain and some representative log entries. We can be
+ reached at <code>dev [at] nutch [dot] apache [dot] org</code>.
+ </p>
+ <p>
+ Our software obeys the robots.txt exclusion standard, described at
+ <a href="http://www.robotstxt.org/wc/exclusion.html#robotstxt">
+ http://www.robotstxt.org/wc/exclusion.html#robotstxt</a>. Different
+ installations of the Nutch software may specify different agent
+ names, but all should respond to the agent name "Nutch". Thus to
+ ban all Nutch-based crawlers from your site, place the following in
+ your robots.txt file:</p>
+ <pre>User-agent: Nutch<br>Disallow: /</pre>
+ </div>
+ <div class="page-header">
+ <h1>Webmasters/Robots META</h1>
+ <p>
+ If you do not have permission to edit the /robots.txt file on your
+ server, you can still tell robots not to index your pages or follow
+ your links. The standard mechanism for this is the robots META tag,
+ as described at<a href="http://www.robotstxt.org/wc/meta-user.html">
+ http://www.robotstxt.org/wc/meta-user.html</a>.
+ </p>
+ </div>
+ <div class="page-header">
+ <h1>Contact us</h1>
+ <p>
+ If your site has problems or questions about the Nutch crawler,
+ please send an email to the
+ <code>agent [at] nutch [dot] apache [dot] org</code>
+ - Nutch agent mailing list.
+ </p>
+ </div>
+ </section>
+</div>