You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Tejas Patil (JIRA)" <ji...@apache.org> on 2014/01/28 20:38:45 UTC

[jira] [Updated] (NUTCH-1718) update description of property http.robots.agent

     [ https://issues.apache.org/jira/browse/NUTCH-1718?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Tejas Patil updated NUTCH-1718:
-------------------------------

    Attachment: NUTCH-1718-trunk.v1.patch

Thanks [~wastl-nagel] for bringing this up. I should have updated the documentation with NUTCH-1715 but lost track of the same.

In addition to having a documentation, I am proposing this: 
Instead of making users to have 'http.agent.name' as the first agent in 'http.robots.agents', make the program do that automatically. So users would make use of 'http.robots.agents' to specify any additional agents apart from 'http.agent.name'. Here is a patch for the same.

> update description of property http.robots.agent
> ------------------------------------------------
>
>                 Key: NUTCH-1718
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1718
>             Project: Nutch
>          Issue Type: Bug
>          Components: fetcher
>    Affects Versions: 1.7, 2.2, 2.2.1
>            Reporter: Sebastian Nagel
>            Priority: Trivial
>             Fix For: 2.3, 1.8
>
>         Attachments: NUTCH-1718-trunk.v1.patch
>
>
> The description of property http.robots.agent in nutch-default.xml recommends to add a '*' to the list of agent names. This will cause the same problem as described in NUTCH-1715. The description should be updated. Also regarding "order of precedence" which is dictated since NUTCH-1031 only by ordering of user agents in robots.txt.
> {code:xml}
> <property>
>   <name>http.robots.agents</name>
>   <value>*</value>
>   <description>The agent strings we'll look for in robots.txt files,
>   comma-separated, in decreasing order of precedence. You should
>   put the value of http.agent.name as the first agent name, and keep the
>   default * at the end of the list. E.g.: BlurflDev,Blurfl,*
>   </description>
> </property>
> {code}



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)