You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@manifoldcf.apache.org by "Karl Wright (JIRA)" <ji...@apache.org> on 2017/02/27 21:35:45 UTC

[jira] [Commented] (CONNECTORS-1392) Add option for Web connector to ignore robots instructions in meta tags and rel attributes

    [ https://issues.apache.org/jira/browse/CONNECTORS-1392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15886565#comment-15886565 ] 

Karl Wright commented on CONNECTORS-1392:
-----------------------------------------

Hi [~schuch], I think it is likely that people who are breaking the rules will break some of them but not *all* of them.  The reason that the meta and rel rules are currently hardwired is because UIs that have "execution" buttons of any kind really shouldn't be clicking those buttons.

There's also the problem that you will *absolutely* need to maintain backwards compatibility.  If you fold this change of functionality together with the robots processing, there is no way to do that.  So I encourage you to make separate controls/switches for *each* rule you want to be able to break.


> Add option for Web connector to ignore robots instructions in meta tags and rel attributes
> ------------------------------------------------------------------------------------------
>
>                 Key: CONNECTORS-1392
>                 URL: https://issues.apache.org/jira/browse/CONNECTORS-1392
>             Project: ManifoldCF
>          Issue Type: New Feature
>          Components: Web connector
>            Reporter: Markus Schuch
>
> The Web connectors already allows to ignore robots.txt by option.
> With this ticket, another option is added, to allow the connector to ignore robots instructions in {{<meta name="robots ...}} tags and {{<a ... rel="nofollow" ...}} attributes.
> *First proposal (to be discussed)*
> Reuse the existing "Robots.txt usage" option in the "Robots" Tab. Rename the existing options:
> # Don't look at robots.txt, meta robots and rel attributes
> # Obey robots.txt, meta robots tags and rel attributes for data fetches only
> # Obey robots.txt, meta robots tags and rel attributes _(the default)_
> The end user doc needs to be updated.
> Google ressources on robot instructions in HTML pages:
> [0] https://support.google.com/webmasters/answer/79812?hl=en&ctx=cb&src=cb&cbid=tnnsjq5jcodt&cbrank=4
> [1] https://support.google.com/webmasters/answer/96569?hl=en&ctx=cb&src=cb&cbid=-5rmggrfsp2rq&cbrank=3
> Thread on the mailing list
> [2] https://www.mail-archive.com/user@manifoldcf.apache.org/msg03258.html



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)