You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Markus Jelsma (JIRA)" <ji...@apache.org> on 2011/06/23 17:55:47 UTC

[jira] [Issue Comment Edited] (NUTCH-1011) Normalize duplicate slashes in URL's

    [ https://issues.apache.org/jira/browse/NUTCH-1011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13053937#comment-13053937 ] 

Markus Jelsma edited comment on NUTCH-1011 at 6/23/11 3:55 PM:
---------------------------------------------------------------

The previous regex seems to eat the character preceding the slashes as well. Here's a new patch using the following expression:
{code}
(?<!:)/{2,}
{code}

      was (Author: markus17):
    The previous regex seems to eat the character preceding the slashes as well. Here's a new patch using the following expression:
(?<!:)/{2,}
  
> Normalize duplicate slashes in URL's
> ------------------------------------
>
>                 Key: NUTCH-1011
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1011
>             Project: Nutch
>          Issue Type: Improvement
>    Affects Versions: 1.4, 2.0
>            Reporter: Markus Jelsma
>            Assignee: Markus Jelsma
>            Priority: Minor
>         Attachments: NUTCH-1011-all-2.patch, NUTCH-1011-all.patch
>
>
> Many websites produce faulty URL's with multiple slashes e.g. http://cocoon.apache.org///////////////////////1.x/dynamic.html
> This can be really nasty if the number of slashes varies, resulting in many URL's actually pointing to the same page and generating new (unique) URL's to the same or other duplicate pages.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira