You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Dennis Kubes (JIRA)" <ji...@apache.org> on 2007/02/18 08:57:07 UTC

[jira] Updated: (NUTCH-247) robot parser to restrict.

     [ https://issues.apache.org/jira/browse/NUTCH-247?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Dennis Kubes updated NUTCH-247:
-------------------------------

    Attachment: agent-names.patch

This patch removes the checks and severe logging from the RobotRulesParser for agent name and robot agents and moves this functionality into the start of the fetcher job.  Now if either the http.agent.name is null or blank or the http.agent.name is not the first advertised agent in the http.robots.agents property, an IllegalArgumentException will be thrown and logged to the user and processing will stop.  This patch also updates the testing crawl-tests.xml used by the TestFetcher unit tests.  All core unit tests have successfully passed and this has successfully been run againt a fetch cycle.

> robot parser to restrict.
> -------------------------
>
>                 Key: NUTCH-247
>                 URL: https://issues.apache.org/jira/browse/NUTCH-247
>             Project: Nutch
>          Issue Type: Bug
>          Components: fetcher
>    Affects Versions: 0.8
>            Reporter: Stefan Groschupf
>         Assigned To: Dennis Kubes
>            Priority: Minor
>             Fix For: 0.9.0
>
>         Attachments: agent-names.patch
>
>
> If the agent name and the robots agents are not proper configure the Robot rule parser uses LOG.severe to log the problem but solve it also. 
> Later on the fetcher thread checks for severe errors and stop if there is one.
> RobotRulesParser:
> if (agents.size() == 0) {
>       agents.add(agentName);
>       LOG.severe("No agents listed in 'http.robots.agents' property!");
>     } else if (!((String)agents.get(0)).equalsIgnoreCase(agentName)) {
>       agents.add(0, agentName);
>       LOG.severe("Agent we advertise (" + agentName
>                  + ") not listed first in 'http.robots.agents' property!");
>     }
> Fetcher.FetcherThread:
>  if (LogFormatter.hasLoggedSevere())     // something bad happened
>             break;  
> I suggest to use warn or something similar instead of severe to log this problem.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.