You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@nutch.apache.org by Doğacan Güney <do...@agmlab.com> on 2007/02/15 12:07:44 UTC

lib-http crawl-delay problem

Hi,

There seems to be two small bugs in lib-http's RobotRulesParser.

First is about reading crawl-delay. The code doesn't check for addRules,
so the nutch bot will get the crawl-delay value of another robot's
crawl-delay in robots.txt. Let me try to be more clear:

User-agent: foobot
Crawl-delay: 3600

User-agent: *
Disallow:


In such a robots.txt file, nutch bot will get 3600 as its crawl-delay
value, no matter what nutch bot's name actually is.

Second is about main method. RobotRulesParser.main advertises its usage
as "<robots-file> <url-file> <agent-name>+" but if you give it more than
one agent time it refuses it.

Trivial patch attached.

--
Doğacan Güney

Re: lib-http crawl-delay problem

Posted by rubdabadub <ru...@gmail.com>.

Thanks for the link!



On 2/15/07, Doğacan Güney <do...@agmlab.com> wrote:
> rubdabadub wrote:
> > Hi:
> >
> > I am unable to get the attached patch via mail. Its better if you
> > create a JIra issue and attached the patch there.
> >
> > Thank you.
> >
>
> I don't know, this bug seems too minor to require its own JIRA issue.
> So I put the patch to
> http://www.ceng.metu.edu.tr/~e1345172/crawl-delay.patch
>
>

Re: lib-http crawl-delay problem

Posted by Doğacan Güney <do...@agmlab.com>.

rubdabadub wrote:
> Hi:
>
> I am unable to get the attached patch via mail. Its better if you
> create a JIra issue and attached the patch there.
>
> Thank you.
>

I don't know, this bug seems too minor to require its own JIRA issue.
So I put the patch to
http://www.ceng.metu.edu.tr/~e1345172/crawl-delay.patch

Re: lib-http crawl-delay problem

Posted by rubdabadub <ru...@gmail.com>.

Hi:

I am unable to get the attached patch via mail. Its better if you
create a JIra issue and attached the patch there.

Thank you.

On 2/15/07, Doğacan Güney <do...@agmlab.com> wrote:
> Hi,
>
> There seems to be two small bugs in lib-http's RobotRulesParser.
>
> First is about reading crawl-delay. The code doesn't check for addRules,
> so the nutch bot will get the crawl-delay value of another robot's
> crawl-delay in robots.txt. Let me try to be more clear:
>
> User-agent: foobot
> Crawl-delay: 3600
>
> User-agent: *
> Disallow:
>
>
> In such a robots.txt file, nutch bot will get 3600 as its crawl-delay
> value, no matter what nutch bot's name actually is.
>
> Second is about main method. RobotRulesParser.main advertises its usage
> as "<robots-file> <url-file> <agent-name>+" but if you give it more than
> one agent time it refuses it.
>
> Trivial patch attached.
>
> --
> Doğacan Güney
>
>