You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by mabi <ma...@protonmail.ch> on 2017/12/10 22:16:32 UTC

robots.txt Disallow not respected

Hello,

I am crawling my website with Nutch 2.3.1 and somehow Nutch does not respect the robots.txt Disallow from my website. I have the following very simple robots.txt file:

User-agent: *
Disallow: /wpblog/feed/

Still the /wpblog/feed/ URL gets parsed and finally indexed.

Do I need to enable anything special in the nutch-site.xml config file maybe?

Thanks,
Mabi

Re: robots.txt Disallow not respected

Posted by Chris Mattmann <ma...@apache.org>.

FWIW, in versions of Nutch post 1.10, there is a robots.whitelist property that you can use 
to whitelist sites to ignore robots.txt explicitly.

Cheers,
Chris




On 12/12/17, 2:32 PM, "Sebastian Nagel" <wa...@googlemail.com> wrote:

    :)
    
    On 12/12/2017 11:11 PM, mabi wrote:
    > Sorry my bad, I was using a nutch from a previous project that I had modified and recompiled to ignore the robots.txt file (as there is not flag to enable/disable that).
    > 
    > I confirm that the parsing of robots.txt works.
    > 
    > 
    >> -------- Original Message --------
    >> Subject: Re: robots.txt Disallow not respected
    >> Local Time: December 12, 2017 10:54 PM
    >> UTC Time: December 12, 2017 9:54 PM
    >> From: mabi@protonmail.ch
    >> To: user@nutch.apache.org <us...@nutch.apache.org>
    >>
    >> Hi,
    >>
    >> Yes I tested manually the robots.txt by using nutch's org.apache.nutch.protocol.RobotRulesParser as suggested in the previous mail and when I do that everything works correctly: the URL which are disallowed get "not allowed" and the others "allowed". So I don't understand why this works but not my crawling.
    >> 
    >> I am using https for this website but have a 301 which redirects all http traffic to https. I also tried deleting the whole hbase table as well as the solr core but that did not help either :(
    >>
    >> Regards,
    >> M.
    >>> -------- Original Message --------
    >>> Subject: Re: robots.txt Disallow not respected
    >>> Local Time: December 12, 2017 10:09 AM
    >>> UTC Time: December 12, 2017 9:09 AM
    >>> From: wastl.nagel@googlemail.com
    >>> To: user@nutch.apache.org
    >>> Hi,
    >>> did you already test whether the robots.txt file is correctly parsed
    >>> and rules are applied as expected? See the previous response.
    >>> If https or non-default ports are used: is the robots.txt shipped also
    >>> for other protocol/port combinations? See
    >>> https://issues.apache.org/jira/browse/NUTCH-1752
    >>> Also note that content is not removed when the robots.txt is changed.
    >>> The robots.txt is only applied to a URL which is (re)fetched. To be sure,
    >>> delete the web table (stored in HBase, etc.) and restart the crawl.
    >>> Best,
    >>> Sebastian
    >>> On 12/11/2017 07:39 PM, mabi wrote:
    >>>> Hi Sebastian,
    >>>> I am already using the protocol-httpclient plugin as I also require HTTPS and I checked the access.log from the website I am crawling and see that it did a GET on the robots.txt as you can see here:
    >>>> 123.123.123.123 - - [10/Dec/2017:23:09:29 +0100] "GET /robots.txt HTTP/1.0" 200 223 "-" "MyCrawler/0.1"
    >>>> What I also did is to enable DEBUG logging in log4j.properties like that:
    >>>> log4j.logger.org.apache.nutch=DEBUG
    >>>> and grep robots in the hadoop.log file but there nothing could be found either, no errors nothing.
    >>>> What else could I try or check?
    >>>> Best,
    >>>> M.
    >>>>> -------- Original Message --------
    >>>>> Subject: Re: robots.txt Disallow not respected
    >>>>> Local Time: December 11, 2017 7:13 AM
    >>>>> UTC Time: December 11, 2017 6:13 AM
    >>>>> From: zoltan.zvara@gmail.com
    >>>>> To: user@nutch.apache.org
    >>>>> Hi,
    >>>>> Check that robots.txt is acquired and parsed correctly. Try to change the protocol to protocol-httpclient.
    >>>>> Z
    >>>>> On 2017-12-10 23:54:14, Sebastian Nagel wastl.nagel@googlemail.com wrote:
    >>>>> Hi,
    >>>>> I've tried to reproduce it. But it works as expected:
    >>>>> % cat robots.txt
    >>>>> User-agent: *
    >>>>> Disallow: /wpblog/feed/
    >>>>> % cat test.txt
    >>>>> http://www.example.com/wpblog/feed/
    >>>>> http://www.example.com/wpblog/feed/index.html
    >>>>> % $NUTCH_HOME/bin/nutch org.apache.nutch.protocol.RobotRulesParser robots.txt test.txt 'myAgent'
    >>>>> not allowed: http://www.example.com/wpblog/feed/
    >>>>> not allowed: http://www.example.com/wpblog/feed/index.html
    >>>>> There are no steps required to make Nutch respect the robots.txt rules.
    >>>>> Only the robots.txt must be properly placed and readable.
    >>>>> Best,
    >>>>> Sebastian
    >>>>> On 12/10/2017 11:16 PM, mabi wrote:
    >>>>>> Hello,
    >>>>>> I am crawling my website with Nutch 2.3.1 and somehow Nutch does not respect the robots.txt
    >>>>>> Disallow from my website. I have the following very simple robots.txt file:
    >>>>>> User-agent: *
    >>>>>> Disallow: /wpblog/feed/
    >>>>>> Still the /wpblog/feed/ URL gets parsed and finally indexed.
    >>>>>> Do I need to enable anything special in the nutch-site.xml config file maybe?
    >>>>>> Thanks,
    >>>>>> Mabi
    >>>>>> 
    >>>>>> 
    >>>>>>

Re: robots.txt Disallow not respected

Posted by Sebastian Nagel <wa...@googlemail.com>.

:)

On 12/12/2017 11:11 PM, mabi wrote:
> Sorry my bad, I was using a nutch from a previous project that I had modified and recompiled to ignore the robots.txt file (as there is not flag to enable/disable that).
> 
> I confirm that the parsing of robots.txt works.
> 
> 
>> -------- Original Message --------
>> Subject: Re: robots.txt Disallow not respected
>> Local Time: December 12, 2017 10:54 PM
>> UTC Time: December 12, 2017 9:54 PM
>> From: mabi@protonmail.ch
>> To: user@nutch.apache.org <us...@nutch.apache.org>
>>
>> Hi,
>>
>> Yes I tested manually the robots.txt by using nutch's org.apache.nutch.protocol.RobotRulesParser as suggested in the previous mail and when I do that everything works correctly: the URL which are disallowed get "not allowed" and the others "allowed". So I don't understand why this works but not my crawling.
>> 
>> I am using https for this website but have a 301 which redirects all http traffic to https. I also tried deleting the whole hbase table as well as the solr core but that did not help either :(
>>
>> Regards,
>> M.
>>> -------- Original Message --------
>>> Subject: Re: robots.txt Disallow not respected
>>> Local Time: December 12, 2017 10:09 AM
>>> UTC Time: December 12, 2017 9:09 AM
>>> From: wastl.nagel@googlemail.com
>>> To: user@nutch.apache.org
>>> Hi,
>>> did you already test whether the robots.txt file is correctly parsed
>>> and rules are applied as expected? See the previous response.
>>> If https or non-default ports are used: is the robots.txt shipped also
>>> for other protocol/port combinations? See
>>> https://issues.apache.org/jira/browse/NUTCH-1752
>>> Also note that content is not removed when the robots.txt is changed.
>>> The robots.txt is only applied to a URL which is (re)fetched. To be sure,
>>> delete the web table (stored in HBase, etc.) and restart the crawl.
>>> Best,
>>> Sebastian
>>> On 12/11/2017 07:39 PM, mabi wrote:
>>>> Hi Sebastian,
>>>> I am already using the protocol-httpclient plugin as I also require HTTPS and I checked the access.log from the website I am crawling and see that it did a GET on the robots.txt as you can see here:
>>>> 123.123.123.123 - - [10/Dec/2017:23:09:29 +0100] "GET /robots.txt HTTP/1.0" 200 223 "-" "MyCrawler/0.1"
>>>> What I also did is to enable DEBUG logging in log4j.properties like that:
>>>> log4j.logger.org.apache.nutch=DEBUG
>>>> and grep robots in the hadoop.log file but there nothing could be found either, no errors nothing.
>>>> What else could I try or check?
>>>> Best,
>>>> M.
>>>>> -------- Original Message --------
>>>>> Subject: Re: robots.txt Disallow not respected
>>>>> Local Time: December 11, 2017 7:13 AM
>>>>> UTC Time: December 11, 2017 6:13 AM
>>>>> From: zoltan.zvara@gmail.com
>>>>> To: user@nutch.apache.org
>>>>> Hi,
>>>>> Check that robots.txt is acquired and parsed correctly. Try to change the protocol to protocol-httpclient.
>>>>> Z
>>>>> On 2017-12-10 23:54:14, Sebastian Nagel wastl.nagel@googlemail.com wrote:
>>>>> Hi,
>>>>> I've tried to reproduce it. But it works as expected:
>>>>> % cat robots.txt
>>>>> User-agent: *
>>>>> Disallow: /wpblog/feed/
>>>>> % cat test.txt
>>>>> http://www.example.com/wpblog/feed/
>>>>> http://www.example.com/wpblog/feed/index.html
>>>>> % $NUTCH_HOME/bin/nutch org.apache.nutch.protocol.RobotRulesParser robots.txt test.txt 'myAgent'
>>>>> not allowed: http://www.example.com/wpblog/feed/
>>>>> not allowed: http://www.example.com/wpblog/feed/index.html
>>>>> There are no steps required to make Nutch respect the robots.txt rules.
>>>>> Only the robots.txt must be properly placed and readable.
>>>>> Best,
>>>>> Sebastian
>>>>> On 12/10/2017 11:16 PM, mabi wrote:
>>>>>> Hello,
>>>>>> I am crawling my website with Nutch 2.3.1 and somehow Nutch does not respect the robots.txt
>>>>>> Disallow from my website. I have the following very simple robots.txt file:
>>>>>> User-agent: *
>>>>>> Disallow: /wpblog/feed/
>>>>>> Still the /wpblog/feed/ URL gets parsed and finally indexed.
>>>>>> Do I need to enable anything special in the nutch-site.xml config file maybe?
>>>>>> Thanks,
>>>>>> Mabi
>>>>>> 
>>>>>> 
>>>>>>

Re: robots.txt Disallow not respected

Posted by mabi <ma...@protonmail.ch>.

Sorry my bad, I was using a nutch from a previous project that I had modified and recompiled to ignore the robots.txt file (as there is not flag to enable/disable that).

I confirm that the parsing of robots.txt works.


>-------- Original Message --------
>Subject: Re: robots.txt Disallow not respected
>Local Time: December 12, 2017 10:54 PM
>UTC Time: December 12, 2017 9:54 PM
>From: mabi@protonmail.ch
>To: user@nutch.apache.org <us...@nutch.apache.org>
>
>Hi,
>
> Yes I tested manually the robots.txt by using nutch's org.apache.nutch.protocol.RobotRulesParser as suggested in the previous mail and when I do that everything works correctly: the URL which are disallowed get "not allowed" and the others "allowed". So I don't understand why this works but not my crawling.
> 
> I am using https for this website but have a 301 which redirects all http traffic to https. I also tried deleting the whole hbase table as well as the solr core but that did not help either :(
>
> Regards,
> M.
>>-------- Original Message --------
>> Subject: Re: robots.txt Disallow not respected
>> Local Time: December 12, 2017 10:09 AM
>> UTC Time: December 12, 2017 9:09 AM
>> From: wastl.nagel@googlemail.com
>> To: user@nutch.apache.org
>>Hi,
>>did you already test whether the robots.txt file is correctly parsed
>> and rules are applied as expected? See the previous response.
>>If https or non-default ports are used: is the robots.txt shipped also
>> for other protocol/port combinations? See
>>https://issues.apache.org/jira/browse/NUTCH-1752
>>Also note that content is not removed when the robots.txt is changed.
>> The robots.txt is only applied to a URL which is (re)fetched. To be sure,
>> delete the web table (stored in HBase, etc.) and restart the crawl.
>>Best,
>> Sebastian
>>On 12/11/2017 07:39 PM, mabi wrote:
>>>Hi Sebastian,
>>> I am already using the protocol-httpclient plugin as I also require HTTPS and I checked the access.log from the website I am crawling and see that it did a GET on the robots.txt as you can see here:
>>> 123.123.123.123 - - [10/Dec/2017:23:09:29 +0100] "GET /robots.txt HTTP/1.0" 200 223 "-" "MyCrawler/0.1"
>>> What I also did is to enable DEBUG logging in log4j.properties like that:
>>> log4j.logger.org.apache.nutch=DEBUG
>>> and grep robots in the hadoop.log file but there nothing could be found either, no errors nothing.
>>> What else could I try or check?
>>> Best,
>>> M.
>>>>-------- Original Message --------
>>>> Subject: Re: robots.txt Disallow not respected
>>>> Local Time: December 11, 2017 7:13 AM
>>>> UTC Time: December 11, 2017 6:13 AM
>>>> From: zoltan.zvara@gmail.com
>>>> To: user@nutch.apache.org
>>>> Hi,
>>>> Check that robots.txt is acquired and parsed correctly. Try to change the protocol to protocol-httpclient.
>>>> Z
>>>> On 2017-12-10 23:54:14, Sebastian Nagel wastl.nagel@googlemail.com wrote:
>>>> Hi,
>>>> I've tried to reproduce it. But it works as expected:
>>>> % cat robots.txt
>>>> User-agent: *
>>>> Disallow: /wpblog/feed/
>>>> % cat test.txt
>>>>http://www.example.com/wpblog/feed/
>>>>http://www.example.com/wpblog/feed/index.html
>>>> % $NUTCH_HOME/bin/nutch org.apache.nutch.protocol.RobotRulesParser robots.txt test.txt 'myAgent'
>>>> not allowed: http://www.example.com/wpblog/feed/
>>>> not allowed: http://www.example.com/wpblog/feed/index.html
>>>> There are no steps required to make Nutch respect the robots.txt rules.
>>>> Only the robots.txt must be properly placed and readable.
>>>> Best,
>>>> Sebastian
>>>> On 12/10/2017 11:16 PM, mabi wrote:
>>>>>Hello,
>>>>> I am crawling my website with Nutch 2.3.1 and somehow Nutch does not respect the robots.txt
>>>>> Disallow from my website. I have the following very simple robots.txt file:
>>>>> User-agent: *
>>>>> Disallow: /wpblog/feed/
>>>>> Still the /wpblog/feed/ URL gets parsed and finally indexed.
>>>>> Do I need to enable anything special in the nutch-site.xml config file maybe?
>>>>> Thanks,
>>>>> Mabi
>>>>> 
>>>>> 
>>>>>

Re: robots.txt Disallow not respected

Posted by mabi <ma...@protonmail.ch>.

Hi,

Yes I tested manually the robots.txt by using nutch's org.apache.nutch.protocol.RobotRulesParser as suggested in the previous mail and when I do that everything works correctly: the URL which are disallowed get "not allowed" and the others "allowed". So I don't understand why this works but not my crawling.

I am using https for this website but have a 301 which redirects all http traffic to https. I also tried deleting the whole hbase table as well as the solr core but that did not help either :(

Regards,
M.

>-------- Original Message --------
>Subject: Re: robots.txt Disallow not respected
>Local Time: December 12, 2017 10:09 AM
>UTC Time: December 12, 2017 9:09 AM
>From: wastl.nagel@googlemail.com
>To: user@nutch.apache.org
>
>Hi,
>
> did you already test whether the robots.txt file is correctly parsed
> and rules are applied as expected? See the previous response.
>
> If https or non-default ports are used: is the robots.txt shipped also
> for other protocol/port combinations? See
>https://issues.apache.org/jira/browse/NUTCH-1752
>
> Also note that content is not removed when the robots.txt is changed.
> The robots.txt is only applied to a URL which is (re)fetched. To be sure,
> delete the web table (stored in HBase, etc.) and restart the crawl.
>
> Best,
> Sebastian
>
> On 12/11/2017 07:39 PM, mabi wrote:
>>Hi Sebastian,
>>I am already using the protocol-httpclient plugin as I also require HTTPS and I checked the access.log from the website I am crawling and see that it did a GET on the robots.txt as you can see here:
>>123.123.123.123 - - [10/Dec/2017:23:09:29 +0100] "GET /robots.txt HTTP/1.0" 200 223 "-" "MyCrawler/0.1"
>>What I also did is to enable DEBUG logging in log4j.properties like that:
>>log4j.logger.org.apache.nutch=DEBUG
>>and grep robots in the hadoop.log file but there nothing could be found either, no errors nothing.
>>What else could I try or check?
>>Best,
>> M.
>>>-------- Original Message --------
>>> Subject: Re: robots.txt Disallow not respected
>>> Local Time: December 11, 2017 7:13 AM
>>> UTC Time: December 11, 2017 6:13 AM
>>> From: zoltan.zvara@gmail.com
>>> To: user@nutch.apache.org
>>>Hi,
>>>Check that robots.txt is acquired and parsed correctly. Try to change the protocol to protocol-httpclient.
>>>Z
>>> On 2017-12-10 23:54:14, Sebastian Nagel wastl.nagel@googlemail.com wrote:
>>> Hi,
>>>I've tried to reproduce it. But it works as expected:
>>>% cat robots.txt
>>> User-agent: *
>>> Disallow: /wpblog/feed/
>>>% cat test.txt
>>>http://www.example.com/wpblog/feed/
>>>http://www.example.com/wpblog/feed/index.html
>>>% $NUTCH_HOME/bin/nutch org.apache.nutch.protocol.RobotRulesParser robots.txt test.txt 'myAgent'
>>> not allowed: http://www.example.com/wpblog/feed/
>>> not allowed: http://www.example.com/wpblog/feed/index.html
>>>There are no steps required to make Nutch respect the robots.txt rules.
>>> Only the robots.txt must be properly placed and readable.
>>>Best,
>>> Sebastian
>>>On 12/10/2017 11:16 PM, mabi wrote:
>>>>Hello,
>>>> I am crawling my website with Nutch 2.3.1 and somehow Nutch does not respect the robots.txt
>>>> Disallow from my website. I have the following very simple robots.txt file:
>>>> User-agent: *
>>>> Disallow: /wpblog/feed/
>>>> Still the /wpblog/feed/ URL gets parsed and finally indexed.
>>>> Do I need to enable anything special in the nutch-site.xml config file maybe?
>>>> Thanks,
>>>> Mabi
>>>> 
>>>> 
>>>>
>

Re: robots.txt Disallow not respected

Posted by Sebastian Nagel <wa...@googlemail.com>.

Hi,

did you already test whether the robots.txt file is correctly parsed
and rules are applied as expected? See the previous response.

If https or non-default ports are used: is the robots.txt shipped also
for other protocol/port combinations? See
   https://issues.apache.org/jira/browse/NUTCH-1752

Also note that content is not removed when the robots.txt is changed.
The robots.txt is only applied to a URL which is (re)fetched. To be sure,
delete the web table (stored in HBase, etc.) and restart the crawl.

Best,
Sebastian

On 12/11/2017 07:39 PM, mabi wrote:
> Hi Sebastian,
> 
> I am already using the protocol-httpclient plugin as I also require HTTPS and I checked the access.log from the website I am crawling and see that it did a GET on the robots.txt as you can see here:
> 
> 123.123.123.123 - - [10/Dec/2017:23:09:29 +0100] "GET /robots.txt HTTP/1.0" 200 223 "-" "MyCrawler/0.1"
> 
> What I also did is to enable DEBUG logging in log4j.properties like that:
> 
> log4j.logger.org.apache.nutch=DEBUG
> 
> and grep robots in the hadoop.log file but there nothing could be found either, no errors nothing.
> 
> What else could I try or check?
> 
> Best,
> M.
> 
>> -------- Original Message --------
>> Subject: Re: robots.txt Disallow not respected
>> Local Time: December 11, 2017 7:13 AM
>> UTC Time: December 11, 2017 6:13 AM
>> From: zoltan.zvara@gmail.com
>> To: user@nutch.apache.org
>>
>> Hi,
>>
>> Check that robots.txt is acquired and parsed correctly. Try to change the protocol to protocol-httpclient.
>>
>> Z
>> On 2017-12-10 23:54:14, Sebastian Nagel wastl.nagel@googlemail.com wrote:
>> Hi,
>>
>> I've tried to reproduce it. But it works as expected:
>>
>> % cat robots.txt
>> User-agent: *
>> Disallow: /wpblog/feed/
>>
>> % cat test.txt
>> http://www.example.com/wpblog/feed/
>> http://www.example.com/wpblog/feed/index.html
>>
>> % $NUTCH_HOME/bin/nutch org.apache.nutch.protocol.RobotRulesParser robots.txt test.txt 'myAgent'
>> not allowed: http://www.example.com/wpblog/feed/
>> not allowed: http://www.example.com/wpblog/feed/index.html
>>
>>
>> There are no steps required to make Nutch respect the robots.txt rules.
>> Only the robots.txt must be properly placed and readable.
>>
>> Best,
>> Sebastian
>>
>>
>> On 12/10/2017 11:16 PM, mabi wrote:
>>> Hello,
>>> I am crawling my website with Nutch 2.3.1 and somehow Nutch does not respect the robots.txt
>>> Disallow from my website. I have the following very simple robots.txt file:
>>> User-agent: *
>>> Disallow: /wpblog/feed/
>>> Still the /wpblog/feed/ URL gets parsed and finally indexed.
>>> Do I need to enable anything special in the nutch-site.xml config file maybe?
>>> Thanks,
>>> Mabi
>>> 
>>> 
>>>

Re: robots.txt Disallow not respected

Posted by mabi <ma...@protonmail.ch>.

Hi Sebastian,

I am already using the protocol-httpclient plugin as I also require HTTPS and I checked the access.log from the website I am crawling and see that it did a GET on the robots.txt as you can see here:

123.123.123.123 - - [10/Dec/2017:23:09:29 +0100] "GET /robots.txt HTTP/1.0" 200 223 "-" "MyCrawler/0.1"

What I also did is to enable DEBUG logging in log4j.properties like that:

log4j.logger.org.apache.nutch=DEBUG

and grep robots in the hadoop.log file but there nothing could be found either, no errors nothing.

What else could I try or check?

Best,
M.

>-------- Original Message --------
>Subject: Re: robots.txt Disallow not respected
>Local Time: December 11, 2017 7:13 AM
>UTC Time: December 11, 2017 6:13 AM
>From: zoltan.zvara@gmail.com
>To: user@nutch.apache.org
>
>Hi,
>
> Check that robots.txt is acquired and parsed correctly. Try to change the protocol to protocol-httpclient.
>
> Z
> On 2017-12-10 23:54:14, Sebastian Nagel wastl.nagel@googlemail.com wrote:
> Hi,
>
> I've tried to reproduce it. But it works as expected:
>
> % cat robots.txt
> User-agent: *
> Disallow: /wpblog/feed/
>
> % cat test.txt
>http://www.example.com/wpblog/feed/
>http://www.example.com/wpblog/feed/index.html
>
> % $NUTCH_HOME/bin/nutch org.apache.nutch.protocol.RobotRulesParser robots.txt test.txt 'myAgent'
> not allowed: http://www.example.com/wpblog/feed/
> not allowed: http://www.example.com/wpblog/feed/index.html
>
>
> There are no steps required to make Nutch respect the robots.txt rules.
> Only the robots.txt must be properly placed and readable.
>
> Best,
> Sebastian
>
>
> On 12/10/2017 11:16 PM, mabi wrote:
>>Hello,
>>I am crawling my website with Nutch 2.3.1 and somehow Nutch does not respect the robots.txt
>> Disallow from my website. I have the following very simple robots.txt file:
>>User-agent: *
>> Disallow: /wpblog/feed/
>>Still the /wpblog/feed/ URL gets parsed and finally indexed.
>>Do I need to enable anything special in the nutch-site.xml config file maybe?
>>Thanks,
>> Mabi
>> 
>>
>>

Re: robots.txt Disallow not respected

Posted by Zoltán Zvara <zo...@gmail.com>.

Hi,

Check that robots.txt is acquired and parsed correctly. Try to change the protocol to protocol-httpclient.

Z
On 2017-12-10 23:54:14, Sebastian Nagel <wa...@googlemail.com> wrote:
Hi,

I've tried to reproduce it. But it works as expected:

% cat robots.txt
User-agent: *
Disallow: /wpblog/feed/

% cat test.txt
http://www.example.com/wpblog/feed/
http://www.example.com/wpblog/feed/index.html

% $NUTCH_HOME/bin/nutch org.apache.nutch.protocol.RobotRulesParser robots.txt test.txt 'myAgent'
not allowed: http://www.example.com/wpblog/feed/
not allowed: http://www.example.com/wpblog/feed/index.html


There are no steps required to make Nutch respect the robots.txt rules.
Only the robots.txt must be properly placed and readable.

Best,
Sebastian


On 12/10/2017 11:16 PM, mabi wrote:
> Hello,
>
> I am crawling my website with Nutch 2.3.1 and somehow Nutch does not respect the robots.txt
Disallow from my website. I have the following very simple robots.txt file:
>
> User-agent: *
> Disallow: /wpblog/feed/
>
> Still the /wpblog/feed/ URL gets parsed and finally indexed.
>
> Do I need to enable anything special in the nutch-site.xml config file maybe?
>
> Thanks,
> Mabi
> 
>
> 
>

Re: robots.txt Disallow not respected

Posted by Sebastian Nagel <wa...@googlemail.com>.

Hi,

I've tried to reproduce it. But it works as expected:

% cat robots.txt
User-agent: *
Disallow: /wpblog/feed/

% cat test.txt
http://www.example.com/wpblog/feed/
http://www.example.com/wpblog/feed/index.html

% $NUTCH_HOME/bin/nutch org.apache.nutch.protocol.RobotRulesParser robots.txt test.txt 'myAgent'
not allowed:    http://www.example.com/wpblog/feed/
not allowed:    http://www.example.com/wpblog/feed/index.html


There are no steps required to make Nutch respect the robots.txt rules.
Only the robots.txt must be properly placed and readable.

Best,
Sebastian


On 12/10/2017 11:16 PM, mabi wrote:
> Hello,
>
> I am crawling my website with Nutch 2.3.1 and somehow Nutch does not respect the robots.txt
Disallow from my website. I have the following very simple robots.txt file:
>
> User-agent: *
> Disallow: /wpblog/feed/
>
> Still the /wpblog/feed/ URL gets parsed and finally indexed.
>
> Do I need to enable anything special in the nutch-site.xml config file maybe?
>
> Thanks,
> Mabi
> 
>
> 
>