You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by Fuad Efendi <fu...@efendi.ca> on 2005/10/03 05:24:43 UTC
RE: BUG - > RobotRulesParser
OC-0.3.2
http://issues.apache.org/jira/browse/NUTCH-84
This patch also contains this bug,
} else if ( (line.length() >= 6)
&& (line.substring(0, 6).equalsIgnoreCase("Allow:")) )
{
- there is no any "Allow:" field in Robot's standard...
Also, please test with this multiline:
User-agent: ia_archiver
Disallow: /
User-agent: Googlebot-Image
Disallow: /
User-agent: Nutch
Disallow: /
User-agent: TurnitinBot
Disallow: /
-----Original Message-----
From: Fuad Efendi [mailto:fuad@efendi.ca]
Sent: Friday, September 30, 2005 1:23 AM
To: nutch-dev@lucene.apache.org
Cc: 'WebExpertsAmerica'
Subject: BUG - > RobotRulesParser
I noticed this code in protocol-http & protocol-httpclient plugins:
} else if ( (line.length() >= 6)
&& (line.substring(0, 6).equalsIgnoreCase("Allow:")) )
{
However, according to the original 1994 protocol description, there is
NO "Allow:" field. To allow, simply use "Disallow: ".
http://www.robotstxt.org/wc/norobots.html
Please, try to test with www.newegg.com/robots.txt
- their site has this:
User-agent: *
Disallow:
And Nutch does not work with New Egg, but it should!
Sorry guys, I don't have enough time to double-ensure, could you please
verify all this...
I noticed strange discussion at nutch-agent:lucene.apache.org, it seems
that we need to test ......./robots.txt
User-agent: ia_archiver
Disallow: /
User-agent: Googlebot-Image
Disallow: /
User-agent: Nutch
Disallow: /
User-agent: TurnitinBot
Disallow: /
- everything according to standard protocol. Can you retest please
whether it works with multiline? It's a standard!
I see this in code:
StringTokenizer tok = new StringTokenizer(agentNames, ",");
Comma separated? It's not accepted standard yet...
Sorry WebExpertsAmerica, I really didn't have any time to make any
test...
Please do not execute tests against production sites.
Thanks!