You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@manifoldcf.apache.org by "Michael J. Kelleher (Created) (JIRA)" <ji...@apache.org> on 2011/12/07 15:34:39 UTC

[jira] [Created] (CONNECTORS-309) On Canonicalization Tab , Allow regex transforms to modify the URL's for a crawl

On Canonicalization Tab , Allow regex transforms to modify the URL's for a crawl
--------------------------------------------------------------------------------

                 Key: CONNECTORS-309
                 URL: https://issues.apache.org/jira/browse/CONNECTORS-309
             Project: ManifoldCF
          Issue Type: Improvement
          Components: Web connector
    Affects Versions: ManifoldCF 0.4
            Reporter: Michael J. Kelleher
            Priority: Minor


There was not a "Component" for a Job.  Canonicalization is part of the Job definition.

I would like the ability to use a regex to transform a URL (not necessarily including the hostname and port).  Specifically what I would like to use this for is to remove certain URL request parameters from the URL.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (CONNECTORS-309) On Canonicalization Tab , Allow regex transforms to modify the URL's for a crawl

Posted by "Karl Wright (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/CONNECTORS-309?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Karl Wright updated CONNECTORS-309:
-----------------------------------

    Fix Version/s: ManifoldCF next
    
> On Canonicalization Tab , Allow regex transforms to modify the URL's for a crawl
> --------------------------------------------------------------------------------
>
>                 Key: CONNECTORS-309
>                 URL: https://issues.apache.org/jira/browse/CONNECTORS-309
>             Project: ManifoldCF
>          Issue Type: Improvement
>          Components: Web connector
>    Affects Versions: ManifoldCF 0.4
>            Reporter: Michael J. Kelleher
>            Priority: Minor
>             Fix For: ManifoldCF next
>
>
> There was not a "Component" for a Job.  Canonicalization is part of the Job definition.
> I would like the ability to use a regex to transform a URL (not necessarily including the hostname and port).  Specifically what I would like to use this for is to remove certain URL request parameters from the URL.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (CONNECTORS-309) On Canonicalization Tab , Allow regex transforms to modify the URL's for a crawl

Posted by "Karl Wright (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/CONNECTORS-309?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Karl Wright updated CONNECTORS-309:
-----------------------------------

    Fix Version/s:     (was: ManifoldCF 0.5)
                   ManifoldCF next
    
> On Canonicalization Tab , Allow regex transforms to modify the URL's for a crawl
> --------------------------------------------------------------------------------
>
>                 Key: CONNECTORS-309
>                 URL: https://issues.apache.org/jira/browse/CONNECTORS-309
>             Project: ManifoldCF
>          Issue Type: Improvement
>          Components: Web connector
>    Affects Versions: ManifoldCF 0.4
>            Reporter: Michael J. Kelleher
>            Assignee: Karl Wright
>            Priority: Minor
>             Fix For: ManifoldCF next
>
>
> There was not a "Component" for a Job.  Canonicalization is part of the Job definition.
> I would like the ability to use a regex to transform a URL (not necessarily including the hostname and port).  Specifically what I would like to use this for is to remove certain URL request parameters from the URL.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (CONNECTORS-309) On Canonicalization Tab , Allow regex transforms to modify the URL's for a crawl

Posted by "Karl Wright (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CONNECTORS-309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13164438#comment-13164438 ] 

Karl Wright commented on CONNECTORS-309:
----------------------------------------

I'd view this as "custom canonicalization" functionality.  The use case Mr. Kelleher relayed to me was to remove a parameter from the query string.  A general regexp is not a good tool for that b/c of issues related to URL encoding, so maybe we could structure it as parameter removal/addition instead.

                
> On Canonicalization Tab , Allow regex transforms to modify the URL's for a crawl
> --------------------------------------------------------------------------------
>
>                 Key: CONNECTORS-309
>                 URL: https://issues.apache.org/jira/browse/CONNECTORS-309
>             Project: ManifoldCF
>          Issue Type: Improvement
>          Components: Web connector
>    Affects Versions: ManifoldCF 0.4
>            Reporter: Michael J. Kelleher
>            Priority: Minor
>
> There was not a "Component" for a Job.  Canonicalization is part of the Job definition.
> I would like the ability to use a regex to transform a URL (not necessarily including the hostname and port).  Specifically what I would like to use this for is to remove certain URL request parameters from the URL.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (CONNECTORS-309) On Canonicalization Tab , Allow regex transforms to modify the URL's for a crawl

Posted by "Karl Wright (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/CONNECTORS-309?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Karl Wright updated CONNECTORS-309:
-----------------------------------

    Fix Version/s:     (was: ManifoldCF next)
                   ManifoldCF 0.5
         Assignee: Karl Wright

As stated this looks straightforward and will probably fit in the 0.5 timeframe.
                
> On Canonicalization Tab , Allow regex transforms to modify the URL's for a crawl
> --------------------------------------------------------------------------------
>
>                 Key: CONNECTORS-309
>                 URL: https://issues.apache.org/jira/browse/CONNECTORS-309
>             Project: ManifoldCF
>          Issue Type: Improvement
>          Components: Web connector
>    Affects Versions: ManifoldCF 0.4
>            Reporter: Michael J. Kelleher
>            Assignee: Karl Wright
>            Priority: Minor
>             Fix For: ManifoldCF 0.5
>
>
> There was not a "Component" for a Job.  Canonicalization is part of the Job definition.
> I would like the ability to use a regex to transform a URL (not necessarily including the hostname and port).  Specifically what I would like to use this for is to remove certain URL request parameters from the URL.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira