You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Ryan Stokes (JIRA)" <ji...@apache.org> on 2018/02/23 17:54:00 UTC
[jira] [Created] (SOLR-12026) SimplePostTool with robots.txt
Ryan Stokes created SOLR-12026:
----------------------------------
Summary: SimplePostTool with robots.txt
Key: SOLR-12026
URL: https://issues.apache.org/jira/browse/SOLR-12026
Project: Solr
Issue Type: Bug
Security Level: Public (Default Security Level. Issues are Public)
Components: SimplePostTool
Affects Versions: 7.2
Reporter: Ryan Stokes
[First issue here, apologies in advance for missteps.]
Three things which could improve working with robots.txt:
# When fetching the corresponding robots.txt for a URL, the port is ignored and so it defaults to :80. If nothing is listening :80, it fetches the page. isDisallowedByRobots() could include the url.getPort() when constructing strRobot. This helps when testing your robots on a non-standard port, such as during development.
# Disallow directives are applied regardless of User-agent. parseRobotsTxt() could override a Disallow which specifies SimplePostTool-crawler. This would help when indexing your own site which you've explicitly allowed for indexing by SimplePostTool. I don't know if that's a good practice, but it would help in testing.
# The User-agent header when fetching robots.txt is not "SimplePostTool-crawler" but shows as "Java/<version>". The code which sets the header correctly from readPageFromUrl() could be reused in isDisallowedByRobots().
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org