You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@nutch.apache.org by bu...@apache.org on 2014/06/09 15:35:50 UTC
svn commit: r911967 - in /websites/staging/nutch/trunk/content: ./ bot.html
Author: buildbot
Date: Mon Jun 9 13:35:50 2014
New Revision: 911967
Log:
Staging update by buildbot for nutch
Modified:
websites/staging/nutch/trunk/content/ (props changed)
websites/staging/nutch/trunk/content/bot.html
Propchange: websites/staging/nutch/trunk/content/
------------------------------------------------------------------------------
--- cms:source-revision (original)
+++ cms:source-revision Mon Jun 9 13:35:50 2014
@@ -1 +1 @@
-1601369
+1601376
Modified: websites/staging/nutch/trunk/content/bot.html
==============================================================================
--- websites/staging/nutch/trunk/content/bot.html (original)
+++ websites/staging/nutch/trunk/content/bot.html Mon Jun 9 13:35:50 2014
@@ -167,72 +167,66 @@ specific language governing permissions
under the License.
-->
-<div class="codehilite"><pre><span class="o"><</span>!<span class="o">--</span> <span class="n">Subhead</span>
-</pre></div>
+<!-- Subhead
+================================================== -->
-
-<p>================================================== -->
- <header class="jumbotron subhead" id="overview">
- <div class="container">
- <h1>Nutch Robot</h1>
- <p class="lead">A page for SysAdmins/WebMasters and other angry
- people... ;)</p>
- </div>
- </header></p>
-<div class="codehilite"><pre><span class="nt"><div</span> <span class="na">class=</span><span class="s">"container"</span><span class="nt">></span>
- <span class="c"><!-- Typography ================================================== --></span>
- <span class="nt"><section</span> <span class="na">id=</span><span class="s">"application"</span><span class="nt">></span>
- <span class="nt"><div</span> <span class="na">class=</span><span class="s">"page-header"</span><span class="nt">></span>
- <span class="nt"><h1></span>Introduction<span class="nt"></h1></span>
- <span class="nt"><p></span>If you're reading this, chances are you've seen a Nutch-based
- robot visiting your site while looking through your server logs.
- Our software obeys robots.txt files and robot META tags in HTML.
- These are the standard mechanisms for webmasters to tell web robots
- which portions of a site a robot is welcome to access.<span class="nt"></p></span>
- <span class="nt"><h1></span>Sysadmins/robots.txt<span class="nt"></h1></span>
- <span class="nt"><p></span>
- We're a software project, not a service, so please understand that
- a misbehaving crawler appearing with our Agent string is not run by
- us. Our software may be run by anyone. However, we'd still like to
- hear about any bad behavior. If possible, please include the name
- of the domain and some representative log entries. We can be
- reached at
- <span class="nt"><code></span>dev [at] nutch [dot] apache [dot] org<span class="nt"></code></span>
- .
- <span class="nt"></p></span>
- <span class="nt"><p></span>
- Our software obeys the robots.txt exclusion standard, described at
- <span class="nt"><a</span> <span class="na">href=</span><span class="s">"http://www.robotstxt.org/wc/exclusion.html#robotstxt"</span><span class="nt">></span>
- http://www.robotstxt.org/wc/exclusion.html#robotstxt<span class="nt"></a></span>. Different
- installations of the Nutch software may specify different agent
- names, but all should respond to the agent name "Nutch". Thus to
- ban all Nutch-based crawlers from your site, place the following in
- your robots.txt file:
- <span class="nt"></p></span>
- <span class="nt"><pre></span>User-agent: Nutch<span class="nt"><br></span>Disallow: /<span class="nt"></pre></span>
- <span class="nt"></div></span>
- <span class="nt"><div</span> <span class="na">class=</span><span class="s">"page-header"</span><span class="nt">></span>
- <span class="nt"><h1></span>Webmasters/Robots META<span class="nt"></h1></span>
- <span class="nt"><p></span>
- If you do not have permission to edit the /robots.txt file on your
- server, you can still tell robots not to index your pages or follow
- your links. The standard mechanism for this is the robots META tag,
- as described at<span class="nt"><a</span> <span class="na">href=</span><span class="s">"http://www.robotstxt.org/wc/meta-user.html"</span><span class="nt">></span>
- http://www.robotstxt.org/wc/meta-user.html<span class="nt"></a></span>.
- <span class="nt"></p></span>
- <span class="nt"></div></span>
- <span class="nt"><div</span> <span class="na">class=</span><span class="s">"page-header"</span><span class="nt">></span>
- <span class="nt"><h1></span>Contact us<span class="nt"></h1></span>
- <span class="nt"><p></span>
- If your site has problems or questions about the Nutch crawler,
- please send an email to the
- <span class="nt"><code></span>agent [at] nutch [dot] apache [dot] org<span class="nt"></code></span>
- - Nutch agent mailing list.
- <span class="nt"></p></span>
- <span class="nt"></div></span>
- <span class="nt"></section></span>
-<span class="nt"></div></span>
-</pre></div></div>
+<header class="jumbotron subhead" id="overview">
+ <div class="container">
+ <h1>Nutch Robot</h1>
+ <p class="lead">A page for SysAdmins/WebMasters and other angry people... ;)</p>
+ </div>
+</header>
+
+<div class="container">
+ <!-- Typography ================================================== -->
+ <section id="application">
+ <div class="page-header">
+ <h1>Introduction</h1>
+ <p>If you're reading this, chances are you've seen a Nutch-based
+ robot visiting your site while looking through your server logs.
+ Our software obeys robots.txt files and robot META tags in HTML.
+ These are the standard mechanisms for webmasters to tell web robots
+ which portions of a site a robot is welcome to access.</p>
+ <h1>Sysadmins/robots.txt</h1>
+ <p>
+ We're a software project, not a service, so please understand that
+ a misbehaving crawler appearing with our Agent string is not run by
+ us. Our software may be run by anyone. However, we'd still like to
+ hear about any bad behavior. If possible, please include the name
+ of the domain and some representative log entries. We can be
+ reached at <code>dev [at] nutch [dot] apache [dot] org</code>.
+ </p>
+ <p>
+ Our software obeys the robots.txt exclusion standard, described at
+ <a href="http://www.robotstxt.org/wc/exclusion.html#robotstxt">
+ http://www.robotstxt.org/wc/exclusion.html#robotstxt</a>. Different
+ installations of the Nutch software may specify different agent
+ names, but all should respond to the agent name "Nutch". Thus to
+ ban all Nutch-based crawlers from your site, place the following in
+ your robots.txt file:</p>
+ <pre>User-agent: Nutch<br>Disallow: /</pre>
+ </div>
+ <div class="page-header">
+ <h1>Webmasters/Robots META</h1>
+ <p>
+ If you do not have permission to edit the /robots.txt file on your
+ server, you can still tell robots not to index your pages or follow
+ your links. The standard mechanism for this is the robots META tag,
+ as described at<a href="http://www.robotstxt.org/wc/meta-user.html">
+ http://www.robotstxt.org/wc/meta-user.html</a>.
+ </p>
+ </div>
+ <div class="page-header">
+ <h1>Contact us</h1>
+ <p>
+ If your site has problems or questions about the Nutch crawler,
+ please send an email to the
+ <code>agent [at] nutch [dot] apache [dot] org</code>
+ - Nutch agent mailing list.
+ </p>
+ </div>
+ </section>
+</div></div>
<!-- /container (main block) -->
<hr>