You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Tejas Patil (JIRA)" <ji...@apache.org> on 2013/05/04 08:42:16 UTC

[jira] [Commented] (NUTCH-1513) Support Robots.txt for Ftp urls

    [ https://issues.apache.org/jira/browse/NUTCH-1513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13649026#comment-13649026 ] 

Tejas Patil commented on NUTCH-1513:
------------------------------------

One thing that I forgot to mention: The change picks up the agent names from http.agent.name and http.robots.agents. I could have added ftp.agent.name etc.. new configs but I dont see a point on doing that because both these configs would generally carry same values and so creating new ones would just add to the whole nest of already existing configs. What say ?
                
> Support Robots.txt for Ftp urls
> -------------------------------
>
>                 Key: NUTCH-1513
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1513
>             Project: Nutch
>          Issue Type: Improvement
>    Affects Versions: 1.7, 2.2
>            Reporter: Tejas Patil
>            Assignee: Tejas Patil
>            Priority: Minor
>              Labels: robots.txt
>             Fix For: 2.3, 1.8
>
>         Attachments: NUTCH-1513.trunk.patch
>
>
> As per [0], a FTP website can have robots.txt like [1]. In the nutch code, Ftp plugin is not parsing the robots file and accepting all urls.
> In "_src/plugin/protocol-ftp/src/java/org/apache/nutch/protocol/ftp/Ftp.java_"
> {noformat}   public RobotRules getRobotRules(Text url, CrawlDatum datum) {
>     return EmptyRobotRules.RULES;
>   }{noformat} 
> Its not clear of this was part of design or if its a bug. 
> [0] : https://developers.google.com/webmasters/control-crawl-index/docs/robots_txt
> [1] : ftp://example.com/robots.txt

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira