You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Markus Jelsma (JIRA)" <ji...@apache.org> on 2011/06/23 15:47:47 UTC

[jira] [Created] (NUTCH-1011) Remove double slashes

Remove double slashes
---------------------

                 Key: NUTCH-1011
                 URL: https://issues.apache.org/jira/browse/NUTCH-1011
             Project: Nutch
          Issue Type: Improvement
    Affects Versions: 1.4, 2.0
            Reporter: Markus Jelsma
            Assignee: Markus Jelsma
            Priority: Minor


Many websites produce faulty URL's with multiple slashes e.g. http://cocoon.apache.org///////////////////////1.x/dynamic.html
This can be really nasty if the number of slashes varies, resulting in many URL's actually pointing to the same page and generating new (unique) URL's to the same or other duplicate pages.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Closed] (NUTCH-1011) Normalize duplicate slashes in URL's

Posted by "Markus Jelsma (Closed) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-1011?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Markus Jelsma closed NUTCH-1011.
--------------------------------


Bulk close of resolved issues of 1.4. bulkclose-1.4-20111220
                
> Normalize duplicate slashes in URL's
> ------------------------------------
>
>                 Key: NUTCH-1011
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1011
>             Project: Nutch
>          Issue Type: Improvement
>    Affects Versions: 1.4, nutchgora
>            Reporter: Markus Jelsma
>            Assignee: Markus Jelsma
>            Priority: Minor
>             Fix For: 1.4, nutchgora
>
>         Attachments: NUTCH-1011-1.4-2.patch, NUTCH-1011-all-3.patch
>
>
> Many websites produce faulty URL's with multiple slashes e.g. http://cocoon.apache.org///////////////////////1.x/dynamic.html
> This can be really nasty if the number of slashes varies, resulting in many URL's actually pointing to the same page and generating new (unique) URL's to the same or other duplicate pages.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (NUTCH-1011) Normalize duplicate slashes in URL's

Posted by "Markus Jelsma (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-1011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13060084#comment-13060084 ] 

Markus Jelsma commented on NUTCH-1011:
--------------------------------------

Most likely not but i will look in to it and make sure it does.

> Normalize duplicate slashes in URL's
> ------------------------------------
>
>                 Key: NUTCH-1011
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1011
>             Project: Nutch
>          Issue Type: Improvement
>    Affects Versions: 1.4, 2.0
>            Reporter: Markus Jelsma
>            Assignee: Markus Jelsma
>            Priority: Minor
>             Fix For: 1.4, 2.0
>
>         Attachments: NUTCH-1011-all-3.patch
>
>
> Many websites produce faulty URL's with multiple slashes e.g. http://cocoon.apache.org///////////////////////1.x/dynamic.html
> This can be really nasty if the number of slashes varies, resulting in many URL's actually pointing to the same page and generating new (unique) URL's to the same or other duplicate pages.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (NUTCH-1011) Normalize duplicate slashes in URL's

Posted by "Julien Nioche (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-1011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13060637#comment-13060637 ] 

Julien Nioche commented on NUTCH-1011:
--------------------------------------

great. +1 to commit

> Normalize duplicate slashes in URL's
> ------------------------------------
>
>                 Key: NUTCH-1011
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1011
>             Project: Nutch
>          Issue Type: Improvement
>    Affects Versions: 1.4, 2.0
>            Reporter: Markus Jelsma
>            Assignee: Markus Jelsma
>            Priority: Minor
>             Fix For: 1.4, 2.0
>
>         Attachments: NUTCH-1011-1.4-2.patch, NUTCH-1011-all-3.patch
>
>
> Many websites produce faulty URL's with multiple slashes e.g. http://cocoon.apache.org///////////////////////1.x/dynamic.html
> This can be really nasty if the number of slashes varies, resulting in many URL's actually pointing to the same page and generating new (unique) URL's to the same or other duplicate pages.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (NUTCH-1011) Normalize duplicate slashes in URL's

Posted by "Markus Jelsma (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-1011?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Markus Jelsma updated NUTCH-1011:
---------------------------------

    Fix Version/s: 2.0
                   1.4

> Normalize duplicate slashes in URL's
> ------------------------------------
>
>                 Key: NUTCH-1011
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1011
>             Project: Nutch
>          Issue Type: Improvement
>    Affects Versions: 1.4, 2.0
>            Reporter: Markus Jelsma
>            Assignee: Markus Jelsma
>            Priority: Minor
>             Fix For: 1.4, 2.0
>
>         Attachments: NUTCH-1011-all-3.patch
>
>
> Many websites produce faulty URL's with multiple slashes e.g. http://cocoon.apache.org///////////////////////1.x/dynamic.html
> This can be really nasty if the number of slashes varies, resulting in many URL's actually pointing to the same page and generating new (unique) URL's to the same or other duplicate pages.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Resolved] (NUTCH-1011) Normalize duplicate slashes in URL's

Posted by "Markus Jelsma (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-1011?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Markus Jelsma resolved NUTCH-1011.
----------------------------------

    Resolution: Fixed

Committed for 1.4 in rev. 1143467 and for trunk in rev. 1143468. Thanks Julien for reminding about the test.

> Normalize duplicate slashes in URL's
> ------------------------------------
>
>                 Key: NUTCH-1011
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1011
>             Project: Nutch
>          Issue Type: Improvement
>    Affects Versions: 1.4, 2.0
>            Reporter: Markus Jelsma
>            Assignee: Markus Jelsma
>            Priority: Minor
>             Fix For: 1.4, 2.0
>
>         Attachments: NUTCH-1011-1.4-2.patch, NUTCH-1011-all-3.patch
>
>
> Many websites produce faulty URL's with multiple slashes e.g. http://cocoon.apache.org///////////////////////1.x/dynamic.html
> This can be really nasty if the number of slashes varies, resulting in many URL's actually pointing to the same page and generating new (unique) URL's to the same or other duplicate pages.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (NUTCH-1011) Normalize duplicate slashes in URL's

Posted by "Markus Jelsma (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-1011?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Markus Jelsma updated NUTCH-1011:
---------------------------------

    Attachment: NUTCH-1011-all-2.patch

The previous regex seems to eat the character preceding the slashes as well. Here's a new patch using the following expression:
(?<!:)/{2,}

> Normalize duplicate slashes in URL's
> ------------------------------------
>
>                 Key: NUTCH-1011
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1011
>             Project: Nutch
>          Issue Type: Improvement
>    Affects Versions: 1.4, 2.0
>            Reporter: Markus Jelsma
>            Assignee: Markus Jelsma
>            Priority: Minor
>         Attachments: NUTCH-1011-all-2.patch, NUTCH-1011-all.patch
>
>
> Many websites produce faulty URL's with multiple slashes e.g. http://cocoon.apache.org///////////////////////1.x/dynamic.html
> This can be really nasty if the number of slashes varies, resulting in many URL's actually pointing to the same page and generating new (unique) URL's to the same or other duplicate pages.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (NUTCH-1011) Normalize duplicate slashes in URL's

Posted by "Markus Jelsma (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-1011?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Markus Jelsma updated NUTCH-1011:
---------------------------------

    Attachment: NUTCH-1011-all-3.patch

HTML entities must be escaped properly!

> Normalize duplicate slashes in URL's
> ------------------------------------
>
>                 Key: NUTCH-1011
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1011
>             Project: Nutch
>          Issue Type: Improvement
>    Affects Versions: 1.4, 2.0
>            Reporter: Markus Jelsma
>            Assignee: Markus Jelsma
>            Priority: Minor
>         Attachments: NUTCH-1011-all-3.patch
>
>
> Many websites produce faulty URL's with multiple slashes e.g. http://cocoon.apache.org///////////////////////1.x/dynamic.html
> This can be really nasty if the number of slashes varies, resulting in many URL's actually pointing to the same page and generating new (unique) URL's to the same or other duplicate pages.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (NUTCH-1011) Normalize duplicate slashes in URL's

Posted by "Markus Jelsma (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-1011?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Markus Jelsma updated NUTCH-1011:
---------------------------------

    Attachment:     (was: NUTCH-1011-all.patch)

> Normalize duplicate slashes in URL's
> ------------------------------------
>
>                 Key: NUTCH-1011
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1011
>             Project: Nutch
>          Issue Type: Improvement
>    Affects Versions: 1.4, 2.0
>            Reporter: Markus Jelsma
>            Assignee: Markus Jelsma
>            Priority: Minor
>
> Many websites produce faulty URL's with multiple slashes e.g. http://cocoon.apache.org///////////////////////1.x/dynamic.html
> This can be really nasty if the number of slashes varies, resulting in many URL's actually pointing to the same page and generating new (unique) URL's to the same or other duplicate pages.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (NUTCH-1011) Normalize duplicate slashes in URL's

Posted by "Markus Jelsma (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-1011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13059458#comment-13059458 ] 

Markus Jelsma commented on NUTCH-1011:
--------------------------------------

With NUTCH-1013 resolved, is patch eligible for inclusion? 

> Normalize duplicate slashes in URL's
> ------------------------------------
>
>                 Key: NUTCH-1011
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1011
>             Project: Nutch
>          Issue Type: Improvement
>    Affects Versions: 1.4, 2.0
>            Reporter: Markus Jelsma
>            Assignee: Markus Jelsma
>            Priority: Minor
>             Fix For: 1.4, 2.0
>
>         Attachments: NUTCH-1011-all-3.patch
>
>
> Many websites produce faulty URL's with multiple slashes e.g. http://cocoon.apache.org///////////////////////1.x/dynamic.html
> This can be really nasty if the number of slashes varies, resulting in many URL's actually pointing to the same page and generating new (unique) URL's to the same or other duplicate pages.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (NUTCH-1011) Normalize duplicate slashes in URL's

Posted by "Markus Jelsma (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-1011?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Markus Jelsma updated NUTCH-1011:
---------------------------------

    Summary: Normalize duplicate slashes in URL's  (was: Remove double slashes)

> Normalize duplicate slashes in URL's
> ------------------------------------
>
>                 Key: NUTCH-1011
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1011
>             Project: Nutch
>          Issue Type: Improvement
>    Affects Versions: 1.4, 2.0
>            Reporter: Markus Jelsma
>            Assignee: Markus Jelsma
>            Priority: Minor
>         Attachments: NUTCH-1011-all.patch
>
>
> Many websites produce faulty URL's with multiple slashes e.g. http://cocoon.apache.org///////////////////////1.x/dynamic.html
> This can be really nasty if the number of slashes varies, resulting in many URL's actually pointing to the same page and generating new (unique) URL's to the same or other duplicate pages.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (NUTCH-1011) Normalize duplicate slashes in URL's

Posted by "Markus Jelsma (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-1011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13054438#comment-13054438 ] 

Markus Jelsma commented on NUTCH-1011:
--------------------------------------

This normalizer works with NUTCH-1013.
 
{code}
<!-- removes duplicate slashes -->
<regex>
  <pattern>(?&lt;!:)/{2,}</pattern>
  <substitution>/</substitution>
</regex>
{code}

> Normalize duplicate slashes in URL's
> ------------------------------------
>
>                 Key: NUTCH-1011
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1011
>             Project: Nutch
>          Issue Type: Improvement
>    Affects Versions: 1.4, 2.0
>            Reporter: Markus Jelsma
>            Assignee: Markus Jelsma
>            Priority: Minor
>         Attachments: NUTCH-1011-all-3.patch
>
>
> Many websites produce faulty URL's with multiple slashes e.g. http://cocoon.apache.org///////////////////////1.x/dynamic.html
> This can be really nasty if the number of slashes varies, resulting in many URL's actually pointing to the same page and generating new (unique) URL's to the same or other duplicate pages.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (NUTCH-1011) Normalize duplicate slashes in URL's

Posted by "Markus Jelsma (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-1011?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Markus Jelsma updated NUTCH-1011:
---------------------------------

    Attachment:     (was: NUTCH-1011-all-2.patch)

> Normalize duplicate slashes in URL's
> ------------------------------------
>
>                 Key: NUTCH-1011
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1011
>             Project: Nutch
>          Issue Type: Improvement
>    Affects Versions: 1.4, 2.0
>            Reporter: Markus Jelsma
>            Assignee: Markus Jelsma
>            Priority: Minor
>
> Many websites produce faulty URL's with multiple slashes e.g. http://cocoon.apache.org///////////////////////1.x/dynamic.html
> This can be really nasty if the number of slashes varies, resulting in many URL's actually pointing to the same page and generating new (unique) URL's to the same or other duplicate pages.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (NUTCH-1011) Normalize duplicate slashes in URL's

Posted by "Markus Jelsma (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-1011?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Markus Jelsma updated NUTCH-1011:
---------------------------------

    Attachment: NUTCH-1011-1.4-2.patch

Patch now includes unit test, which passes! Everything is happy again!

> Normalize duplicate slashes in URL's
> ------------------------------------
>
>                 Key: NUTCH-1011
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1011
>             Project: Nutch
>          Issue Type: Improvement
>    Affects Versions: 1.4, 2.0
>            Reporter: Markus Jelsma
>            Assignee: Markus Jelsma
>            Priority: Minor
>             Fix For: 1.4, 2.0
>
>         Attachments: NUTCH-1011-1.4-2.patch, NUTCH-1011-all-3.patch
>
>
> Many websites produce faulty URL's with multiple slashes e.g. http://cocoon.apache.org///////////////////////1.x/dynamic.html
> This can be really nasty if the number of slashes varies, resulting in many URL's actually pointing to the same page and generating new (unique) URL's to the same or other duplicate pages.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (NUTCH-1011) Remove double slashes

Posted by "Markus Jelsma (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-1011?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Markus Jelsma updated NUTCH-1011:
---------------------------------

    Attachment: NUTCH-1011-all.patch

Added new expression to detect double slashes that are _not_ part of the schema where a schema is identified as two slashes preceded by a colon:

<regex>
  <pattern>[^:\/\/]\/{2,}</pattern>
  <substitution>/</substitution>
</regex>

This keeps schema's embedded in the url intact e.g. http://example.org?url=http://example.org/


> Remove double slashes
> ---------------------
>
>                 Key: NUTCH-1011
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1011
>             Project: Nutch
>          Issue Type: Improvement
>    Affects Versions: 1.4, 2.0
>            Reporter: Markus Jelsma
>            Assignee: Markus Jelsma
>            Priority: Minor
>         Attachments: NUTCH-1011-all.patch
>
>
> Many websites produce faulty URL's with multiple slashes e.g. http://cocoon.apache.org///////////////////////1.x/dynamic.html
> This can be really nasty if the number of slashes varies, resulting in many URL's actually pointing to the same page and generating new (unique) URL's to the same or other duplicate pages.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (NUTCH-1011) Normalize duplicate slashes in URL's

Posted by "Julien Nioche (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-1011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13060073#comment-13060073 ] 

Julien Nioche commented on NUTCH-1011:
--------------------------------------

Is this case covered by the tests in org.apache.nutch.net.urlnormalizer.regex.TestRegexURLNormalizer?


> Normalize duplicate slashes in URL's
> ------------------------------------
>
>                 Key: NUTCH-1011
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1011
>             Project: Nutch
>          Issue Type: Improvement
>    Affects Versions: 1.4, 2.0
>            Reporter: Markus Jelsma
>            Assignee: Markus Jelsma
>            Priority: Minor
>             Fix For: 1.4, 2.0
>
>         Attachments: NUTCH-1011-all-3.patch
>
>
> Many websites produce faulty URL's with multiple slashes e.g. http://cocoon.apache.org///////////////////////1.x/dynamic.html
> This can be really nasty if the number of slashes varies, resulting in many URL's actually pointing to the same page and generating new (unique) URL's to the same or other duplicate pages.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Issue Comment Edited] (NUTCH-1011) Normalize duplicate slashes in URL's

Posted by "Markus Jelsma (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-1011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13053937#comment-13053937 ] 

Markus Jelsma edited comment on NUTCH-1011 at 6/23/11 3:55 PM:
---------------------------------------------------------------

The previous regex seems to eat the character preceding the slashes as well. Here's a new patch using the following expression:
{code}
(?<!:)/{2,}
{code}

      was (Author: markus17):
    The previous regex seems to eat the character preceding the slashes as well. Here's a new patch using the following expression:
(?<!:)/{2,}
  
> Normalize duplicate slashes in URL's
> ------------------------------------
>
>                 Key: NUTCH-1011
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1011
>             Project: Nutch
>          Issue Type: Improvement
>    Affects Versions: 1.4, 2.0
>            Reporter: Markus Jelsma
>            Assignee: Markus Jelsma
>            Priority: Minor
>         Attachments: NUTCH-1011-all-2.patch, NUTCH-1011-all.patch
>
>
> Many websites produce faulty URL's with multiple slashes e.g. http://cocoon.apache.org///////////////////////1.x/dynamic.html
> This can be really nasty if the number of slashes varies, resulting in many URL's actually pointing to the same page and generating new (unique) URL's to the same or other duplicate pages.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (NUTCH-1011) Normalize duplicate slashes in URL's

Posted by "Hudson (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-1011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13061041#comment-13061041 ] 

Hudson commented on NUTCH-1011:
-------------------------------

Integrated in Nutch-trunk #1538 (See [https://builds.apache.org/job/Nutch-trunk/1538/])
    NUTCH-1011 Remove duplicate slashes from URLs

markus : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1143468
Files : 
* /nutch/trunk/src/test/org/apache/nutch/net/TestURLNormalizers.java
* /nutch/trunk/conf/regex-normalize.xml.template
* /nutch/trunk/CHANGES.txt


> Normalize duplicate slashes in URL's
> ------------------------------------
>
>                 Key: NUTCH-1011
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1011
>             Project: Nutch
>          Issue Type: Improvement
>    Affects Versions: 1.4, 2.0
>            Reporter: Markus Jelsma
>            Assignee: Markus Jelsma
>            Priority: Minor
>             Fix For: 1.4, 2.0
>
>         Attachments: NUTCH-1011-1.4-2.patch, NUTCH-1011-all-3.patch
>
>
> Many websites produce faulty URL's with multiple slashes e.g. http://cocoon.apache.org///////////////////////1.x/dynamic.html
> This can be really nasty if the number of slashes varies, resulting in many URL's actually pointing to the same page and generating new (unique) URL's to the same or other duplicate pages.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (NUTCH-1011) Normalize duplicate slashes in URL's

Posted by "Markus Jelsma (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-1011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13053953#comment-13053953 ] 

Markus Jelsma commented on NUTCH-1011:
--------------------------------------

Oh, it gets better. It seems the used engine cannot deal with my regex?

regex.RegexURLNormalizer - error parsing conf file: org.apache.oro.text.regex.MalformedPatternException: Sequence (?<...) not recognized

> Normalize duplicate slashes in URL's
> ------------------------------------
>
>                 Key: NUTCH-1011
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1011
>             Project: Nutch
>          Issue Type: Improvement
>    Affects Versions: 1.4, 2.0
>            Reporter: Markus Jelsma
>            Assignee: Markus Jelsma
>            Priority: Minor
>         Attachments: NUTCH-1011-all-3.patch
>
>
> Many websites produce faulty URL's with multiple slashes e.g. http://cocoon.apache.org///////////////////////1.x/dynamic.html
> This can be really nasty if the number of slashes varies, resulting in many URL's actually pointing to the same page and generating new (unique) URL's to the same or other duplicate pages.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira