You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@manifoldcf.apache.org by "Karl Wright (JIRA)" <ji...@apache.org> on 2010/09/08 12:17:34 UTC

[jira] Commented: (CONNECTORS-104) Make it easier to limit a web crawl to a single site

    [ https://issues.apache.org/jira/browse/CONNECTORS-104?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12907154#action_12907154 ] 

Karl Wright commented on CONNECTORS-104:
----------------------------------------

Trying to limit to the seed domains automatically would, I think, cause more confusion than help.  I can, however, imagine introducing a checkbox on the "Inclusions" tab that, if checked, would limit the crawl to just the domains represented by the seeds, and even making it checked by default.  The implied regular expression would be:

^http[?s]://<domain>[/$\?]

for each seed, I believe.  (That's potentially a lot of regular expressions if the number of seeds is large, so obviously the logic wouldn't be using regexp's in practice.)


> Make it easier to limit a web crawl to a single site
> ----------------------------------------------------
>
>                 Key: CONNECTORS-104
>                 URL: https://issues.apache.org/jira/browse/CONNECTORS-104
>             Project: Apache Connectors Framework
>          Issue Type: Improvement
>          Components: Web connector
>            Reporter: Jack Krupansky
>            Priority: Minor
>
> Unless the user explicitly enters an include regex carefully, a web crawl can quickly get out of control and start crawling the entire web when all the user may really want is to crawl just a single web site or portion thereof. So, it would be preferable if either by default or with a simple button the crawl could be limited to the seed web site(s).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.