You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Tejas Patil <te...@gmail.com> on 2013/01/04 04:39:35 UTC

Robots.txt for Ftp

Hi,

As per [0], a FTP website can have robots.txt like [1]. In the nutch code,
Ftp plugin is not parsing the robots file and simply accepting any url.

In "src/plugin/protocol-ftp/src/java/org/apache/nutch/protocol/ftp/Ftp.java"

*  public RobotRules getRobotRules(Text url, CrawlDatum datum) {*
*    return EmptyRobotRules.RULES;*
*  }*

Was this done intentionally or is this a bug ?

[0] :
https://developers.google.com/webmasters/control-crawl-index/docs/robots_txt
[1] : ftp://example.com/robots.txt

Thanks,
Tejas Patil

Re: Robots.txt for Ftp

Posted by Lewis John Mcgibbney <le...@gmail.com>.
Hi,

I don't know if it is a bug, however your suggested improvement would be
welcomed without a doubt.

If you could please log a Jira we can review.

Best

Lewis

On Fri, Jan 4, 2013 at 3:39 AM, Tejas Patil <te...@gmail.com>wrote:

> Hi,
>
> As per [0], a FTP website can have robots.txt like [1]. In the nutch code,
> Ftp plugin is not parsing the robots file and simply accepting any url.
>
> In
> "src/plugin/protocol-ftp/src/java/org/apache/nutch/protocol/ftp/Ftp.java"
>
> *  public RobotRules getRobotRules(Text url, CrawlDatum datum) {*
> *    return EmptyRobotRules.RULES;*
> *  }*
>
> Was this done intentionally or is this a bug ?
>
> [0] :
>
> https://developers.google.com/webmasters/control-crawl-index/docs/robots_txt
> [1] : ftp://example.com/robots.txt
>
> Thanks,
> Tejas Patil
>



-- 
*Lewis*