You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by BlackIce <bl...@gmail.com> on 2016/05/24 22:17:21 UTC

Robots.txt

Hi,

I've just seen on a website which tracks bots, that "Tarantula" ,  our
nutch 1.11 based crawler is being classified as not obeying robots.txt.

What's the solution?

Re: Robots.txt

Posted by "Mattmann, Chris A (3980)" <ch...@jpl.nasa.gov>.
Hi,

By default, as I mentioned, Nutch does obey robots.txt. There is
a whitelist property that can be set in nutch-default to selectively
disable it for certain sites (again for valid security research use
cases).

Cheers,
Chris

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattmann@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Director, Information Retrieval and Data Science Group (IRDS)
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
WWW: http://irds.usc.edu/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++










On 5/24/16, 3:24 PM, "BlackIce" <bl...@gmail.com> wrote:

>I don't recall messing with anything to do with robots.txt,  I want us to
>be as polite as possible.
>On May 25, 2016 12:22 AM, "Mattmann, Chris A (3980)" <
>chris.a.mattmann@jpl.nasa.gov> wrote:
>
>> Hi,
>>
>> For security research, there is an option to white-list robots.txt.
>> It’s not enabled by default and must be directly enabled.
>>
>> The solution is - there isn’t one. People used to just hack
>> Nutch and do the same thing by commenting out a line of code
>> which accomplished the same check.
>>
>> Those people that are using Nutch and not obeying robots.txt
>> are doing just that. But Nutch itself by default does obey it.
>>
>> Cheers,
>> Chris
>>
>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>> Chris Mattmann, Ph.D.
>> Chief Architect
>> Instrument Software and Science Data Systems Section (398)
>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>> Office: 168-519, Mailstop: 168-527
>> Email: chris.a.mattmann@nasa.gov
>> WWW:  http://sunset.usc.edu/~mattmann/
>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>> Director, Information Retrieval and Data Science Group (IRDS)
>> Adjunct Associate Professor, Computer Science Department
>> University of Southern California, Los Angeles, CA 90089 USA
>> WWW: http://irds.usc.edu/
>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> On 5/24/16, 3:17 PM, "BlackIce" <bl...@gmail.com> wrote:
>>
>> >Hi,
>> >
>> >I've just seen on a website which tracks bots, that "Tarantula" ,  our
>> >nutch 1.11 based crawler is being classified as not obeying robots.txt.
>> >
>> >What's the solution?
>>

Re: Robots.txt

Posted by BlackIce <bl...@gmail.com>.
I don't recall messing with anything to do with robots.txt,  I want us to
be as polite as possible.
On May 25, 2016 12:22 AM, "Mattmann, Chris A (3980)" <
chris.a.mattmann@jpl.nasa.gov> wrote:

> Hi,
>
> For security research, there is an option to white-list robots.txt.
> It’s not enabled by default and must be directly enabled.
>
> The solution is - there isn’t one. People used to just hack
> Nutch and do the same thing by commenting out a line of code
> which accomplished the same check.
>
> Those people that are using Nutch and not obeying robots.txt
> are doing just that. But Nutch itself by default does obey it.
>
> Cheers,
> Chris
>
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Chris Mattmann, Ph.D.
> Chief Architect
> Instrument Software and Science Data Systems Section (398)
> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> Office: 168-519, Mailstop: 168-527
> Email: chris.a.mattmann@nasa.gov
> WWW:  http://sunset.usc.edu/~mattmann/
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Director, Information Retrieval and Data Science Group (IRDS)
> Adjunct Associate Professor, Computer Science Department
> University of Southern California, Los Angeles, CA 90089 USA
> WWW: http://irds.usc.edu/
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>
>
>
>
>
>
>
>
>
>
> On 5/24/16, 3:17 PM, "BlackIce" <bl...@gmail.com> wrote:
>
> >Hi,
> >
> >I've just seen on a website which tracks bots, that "Tarantula" ,  our
> >nutch 1.11 based crawler is being classified as not obeying robots.txt.
> >
> >What's the solution?
>

Re: Robots.txt

Posted by "Mattmann, Chris A (3980)" <ch...@jpl.nasa.gov>.
Hi,

For security research, there is an option to white-list robots.txt.
It’s not enabled by default and must be directly enabled.

The solution is - there isn’t one. People used to just hack
Nutch and do the same thing by commenting out a line of code 
which accomplished the same check.

Those people that are using Nutch and not obeying robots.txt
are doing just that. But Nutch itself by default does obey it.

Cheers,
Chris

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattmann@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Director, Information Retrieval and Data Science Group (IRDS)
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
WWW: http://irds.usc.edu/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++










On 5/24/16, 3:17 PM, "BlackIce" <bl...@gmail.com> wrote:

>Hi,
>
>I've just seen on a website which tracks bots, that "Tarantula" ,  our
>nutch 1.11 based crawler is being classified as not obeying robots.txt.
>
>What's the solution?