You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Ken Krugler (JIRA)" <ji...@apache.org> on 2011/06/17 20:12:47 UTC

[jira] [Created] (NUTCH-1008) Switch to crawler-commons version of robots.txt parsing code

Switch to crawler-commons version of robots.txt parsing code
------------------------------------------------------------

                 Key: NUTCH-1008
                 URL: https://issues.apache.org/jira/browse/NUTCH-1008
             Project: Nutch
          Issue Type: Improvement
    Affects Versions: 1.4
            Reporter: Ken Krugler
            Priority: Minor


The Bixo project has an improved version of Nutch's robots.txt parsing code.

This was recently contributed to crawler-commons, in a format that should be independent of Bixo, Cascading, and even Hadoop.

Nutch could switch to this, and benefit from more robust parsing, better compliance with ad hoc extensions to the robot exclusion protocol, and a wider community of users/developers for that code.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Closed] (NUTCH-1008) Switch to crawler-commons version of robots.txt parsing code

Posted by "Markus Jelsma (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-1008?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Markus Jelsma closed NUTCH-1008.
--------------------------------

    Resolution: Duplicate

Closed in favor of NUTCH-1031.
                
> Switch to crawler-commons version of robots.txt parsing code
> ------------------------------------------------------------
>
>                 Key: NUTCH-1008
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1008
>             Project: Nutch
>          Issue Type: Improvement
>    Affects Versions: 1.4
>            Reporter: Ken Krugler
>            Priority: Minor
>
> The Bixo project has an improved version of Nutch's robots.txt parsing code.
> This was recently contributed to crawler-commons, in a format that should be independent of Bixo, Cascading, and even Hadoop.
> Nutch could switch to this, and benefit from more robust parsing, better compliance with ad hoc extensions to the robot exclusion protocol, and a wider community of users/developers for that code.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira