You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Doug Cutting (JIRA)" <ji...@apache.org> on 2005/11/05 04:39:19 UTC

[jira] Created: (NUTCH-124) protocol-httpclient does not follow redirects when fetching robots.txt

protocol-httpclient does not follow redirects when fetching robots.txt
----------------------------------------------------------------------

         Key: NUTCH-124
         URL: http://issues.apache.org/jira/browse/NUTCH-124
     Project: Nutch
        Type: Bug
  Components: fetcher  
    Versions: 0.8-dev, 0.7.2-dev    
    Reporter: Doug Cutting


If a site's robots.txt redirects, protocol-httpclient does not correctly fetch the robots.txt and effectively ignores it for the site.  See http://www.webmasterworld.com/forum11/3008.htm.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira


[jira] Commented: (NUTCH-124) protocol-httpclient does not follow redirects when fetching robots.txt

Posted by "Fuad Efendi (JIRA)" <ji...@apache.org>.
    [ http://issues.apache.org/jira/browse/NUTCH-124?page=comments#action_12357186 ] 

Fuad Efendi commented on NUTCH-124:
-----------------------------------

Is such behavior defined in Robots Exclusion Protocol? http://www.robotstxt.org/ If so, it should be some kind of a new field in robots.txt in a source site! Such as
Redirect-Disallow: Nutch

Just compare with Nutch behavior when one site has a link to a page on a second site, and second one has "Disallow" for this page. Nutch handles it correctly. It uses Robots.txt file from the same site as the web page. 

Robots.txt MUST NOT define behavior for foreign sites.


> protocol-httpclient does not follow redirects when fetching robots.txt
> ----------------------------------------------------------------------
>
>          Key: NUTCH-124
>          URL: http://issues.apache.org/jira/browse/NUTCH-124
>      Project: Nutch
>         Type: Bug
>   Components: fetcher
>     Versions: 0.8-dev, 0.7.2-dev
>     Reporter: Doug Cutting
>      Fix For: 0.8-dev

>
> If a site's robots.txt redirects, protocol-httpclient does not correctly fetch the robots.txt and effectively ignores it for the site.  See http://www.webmasterworld.com/forum11/3008.htm.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira


Re: [Nutch-dev] [jira] Resolved: (NUTCH-124) protocol-httpclient does not follow redirects when fetching robots.txt

Posted by Doug Cutting <cu...@nutch.org>.
Massimo Miccoli wrote:
> Ther's a problem with that solution.  The protocol-httpclient now , for 
> some site,  gerate a SEVERE Narrowly avoided an infinite loop in execute
> So the fetcher exit ands only some pages is fetched until the SEVERE 
> message.
> I don't know a solution, for now I switch back to protocoll-http.

Can you provide more details?

Thanks,

Doug

Re: [Nutch-dev] [jira] Resolved: (NUTCH-124) protocol-httpclient does not follow redirects when fetching robots.txt

Posted by Massimo Miccoli <mm...@iltrovatore.it>.
Ther's a problem with that solution.  The protocol-httpclient now , for 
some site,  gerate a SEVERE Narrowly avoided an infinite loop in execute
So the fetcher exit ands only some pages is fetched until the SEVERE 
message.
I don't know a solution, for now I switch back to protocoll-http.



Doug Cutting (JIRA) ha scritto:

>     [ http://issues.apache.org/jira/browse/NUTCH-124?page=all ]
>     
>Doug Cutting resolved NUTCH-124:
>--------------------------------
>
>    Fix Version: 0.8-dev
>     Resolution: Fixed
>
>I have fixed this in the mapred branch.
>
>  
>
>>protocol-httpclient does not follow redirects when fetching robots.txt
>>----------------------------------------------------------------------
>>
>>         Key: NUTCH-124
>>         URL: http://issues.apache.org/jira/browse/NUTCH-124
>>     Project: Nutch
>>        Type: Bug
>>  Components: fetcher
>>    Versions: 0.8-dev, 0.7.2-dev
>>    Reporter: Doug Cutting
>>     Fix For: 0.8-dev
>>    
>>
>
>  
>
>>If a site's robots.txt redirects, protocol-httpclient does not correctly fetch the robots.txt and effectively ignores it for the site.  See http://www.webmasterworld.com/forum11/3008.htm.
>>    
>>
>
>  
>

[jira] Resolved: (NUTCH-124) protocol-httpclient does not follow redirects when fetching robots.txt

Posted by "Doug Cutting (JIRA)" <ji...@apache.org>.
     [ http://issues.apache.org/jira/browse/NUTCH-124?page=all ]
     
Doug Cutting resolved NUTCH-124:
--------------------------------

    Fix Version: 0.8-dev
     Resolution: Fixed

I have fixed this in the mapred branch.

> protocol-httpclient does not follow redirects when fetching robots.txt
> ----------------------------------------------------------------------
>
>          Key: NUTCH-124
>          URL: http://issues.apache.org/jira/browse/NUTCH-124
>      Project: Nutch
>         Type: Bug
>   Components: fetcher
>     Versions: 0.8-dev, 0.7.2-dev
>     Reporter: Doug Cutting
>      Fix For: 0.8-dev

>
> If a site's robots.txt redirects, protocol-httpclient does not correctly fetch the robots.txt and effectively ignores it for the site.  See http://www.webmasterworld.com/forum11/3008.htm.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira