You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by Fuad Efendi <fu...@efendi.ca> on 2005/10/03 05:24:43 UTC

RE: BUG - > RobotRulesParser

OC-0.3.2
http://issues.apache.org/jira/browse/NUTCH-84

This patch also contains this bug,
      } else if ( (line.length() >= 6)
                  && (line.substring(0, 6).equalsIgnoreCase("Allow:")) )
{


- there is no any "Allow:" field in Robot's standard...



Also, please test with this multiline:


User-agent: ia_archiver
Disallow: /

User-agent: Googlebot-Image
Disallow: /

User-agent: Nutch
Disallow: /

User-agent: TurnitinBot
Disallow: /    





-----Original Message-----
From: Fuad Efendi [mailto:fuad@efendi.ca] 
Sent: Friday, September 30, 2005 1:23 AM
To: nutch-dev@lucene.apache.org
Cc: 'WebExpertsAmerica'
Subject: BUG - > RobotRulesParser


I noticed this code in protocol-http & protocol-httpclient plugins:

      } else if ( (line.length() >= 6)
                  && (line.substring(0, 6).equalsIgnoreCase("Allow:")) )
{


However, according to the original 1994 protocol description, there is
NO "Allow:" field. To allow, simply use "Disallow:  ".
http://www.robotstxt.org/wc/norobots.html

Please, try to test with www.newegg.com/robots.txt
- their site has this:
User-agent: *
Disallow: 

And Nutch does not work with New Egg, but it should!

Sorry guys, I don't have enough time to double-ensure, could you please
verify all this...

I noticed strange discussion at nutch-agent:lucene.apache.org, it seems
that we need to test ......./robots.txt

User-agent: ia_archiver
Disallow: /

User-agent: Googlebot-Image
Disallow: /

User-agent: Nutch
Disallow: /

User-agent: TurnitinBot
Disallow: /    


- everything according to standard protocol. Can you retest please
whether it works with multiline? It's a standard!

I see this in code:
   StringTokenizer tok = new StringTokenizer(agentNames, ",");
 
Comma separated? It's not accepted standard yet...

Sorry WebExpertsAmerica, I really didn't have any time to make any
test...

Please do not execute tests against production sites.
Thanks!