You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@lucene.apache.org by "Ryan Stokes (JIRA)" <ji...@apache.org> on 2018/02/23 17:54:00 UTC

[jira] [Created] (SOLR-12026) SimplePostTool with robots.txt

Ryan Stokes created SOLR-12026:
----------------------------------

             Summary: SimplePostTool with robots.txt
                 Key: SOLR-12026
                 URL: https://issues.apache.org/jira/browse/SOLR-12026
             Project: Solr
          Issue Type: Bug
      Security Level: Public (Default Security Level. Issues are Public)
          Components: SimplePostTool
    Affects Versions: 7.2
            Reporter: Ryan Stokes


[First issue here, apologies in advance for missteps.]

Three things which could improve working with robots.txt:
 # When fetching the corresponding robots.txt for a URL, the port is ignored and so it defaults to :80.  If nothing is listening :80, it fetches the page.  isDisallowedByRobots() could include the url.getPort() when constructing strRobot.  This helps when testing your robots on a non-standard port, such as during development.
 # Disallow directives are applied regardless of User-agent.  parseRobotsTxt() could override a Disallow which specifies SimplePostTool-crawler.  This would help when indexing your own site which you've explicitly allowed for indexing by SimplePostTool.  I don't know if that's a good practice, but it would help in testing.
 # The User-agent header when fetching robots.txt is not "SimplePostTool-crawler" but shows as "Java/<version>".  The code which sets the header correctly from readPageFromUrl() could be reused in isDisallowedByRobots().



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org