You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Markus Jelsma (JIRA)" <ji...@apache.org> on 2011/06/23 15:47:47 UTC
[jira] [Created] (NUTCH-1011) Remove double slashes
Remove double slashes
---------------------
Key: NUTCH-1011
URL: https://issues.apache.org/jira/browse/NUTCH-1011
Project: Nutch
Issue Type: Improvement
Affects Versions: 1.4, 2.0
Reporter: Markus Jelsma
Assignee: Markus Jelsma
Priority: Minor
Many websites produce faulty URL's with multiple slashes e.g. http://cocoon.apache.org///////////////////////1.x/dynamic.html
This can be really nasty if the number of slashes varies, resulting in many URL's actually pointing to the same page and generating new (unique) URL's to the same or other duplicate pages.
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Closed] (NUTCH-1011) Normalize duplicate slashes in URL's
Posted by "Markus Jelsma (Closed) (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/NUTCH-1011?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Markus Jelsma closed NUTCH-1011.
--------------------------------
Bulk close of resolved issues of 1.4. bulkclose-1.4-20111220
> Normalize duplicate slashes in URL's
> ------------------------------------
>
> Key: NUTCH-1011
> URL: https://issues.apache.org/jira/browse/NUTCH-1011
> Project: Nutch
> Issue Type: Improvement
> Affects Versions: 1.4, nutchgora
> Reporter: Markus Jelsma
> Assignee: Markus Jelsma
> Priority: Minor
> Fix For: 1.4, nutchgora
>
> Attachments: NUTCH-1011-1.4-2.patch, NUTCH-1011-all-3.patch
>
>
> Many websites produce faulty URL's with multiple slashes e.g. http://cocoon.apache.org///////////////////////1.x/dynamic.html
> This can be really nasty if the number of slashes varies, resulting in many URL's actually pointing to the same page and generating new (unique) URL's to the same or other duplicate pages.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1011) Normalize duplicate slashes in
URL's
Posted by "Markus Jelsma (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/NUTCH-1011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13060084#comment-13060084 ]
Markus Jelsma commented on NUTCH-1011:
--------------------------------------
Most likely not but i will look in to it and make sure it does.
> Normalize duplicate slashes in URL's
> ------------------------------------
>
> Key: NUTCH-1011
> URL: https://issues.apache.org/jira/browse/NUTCH-1011
> Project: Nutch
> Issue Type: Improvement
> Affects Versions: 1.4, 2.0
> Reporter: Markus Jelsma
> Assignee: Markus Jelsma
> Priority: Minor
> Fix For: 1.4, 2.0
>
> Attachments: NUTCH-1011-all-3.patch
>
>
> Many websites produce faulty URL's with multiple slashes e.g. http://cocoon.apache.org///////////////////////1.x/dynamic.html
> This can be really nasty if the number of slashes varies, resulting in many URL's actually pointing to the same page and generating new (unique) URL's to the same or other duplicate pages.
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1011) Normalize duplicate slashes in
URL's
Posted by "Julien Nioche (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/NUTCH-1011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13060637#comment-13060637 ]
Julien Nioche commented on NUTCH-1011:
--------------------------------------
great. +1 to commit
> Normalize duplicate slashes in URL's
> ------------------------------------
>
> Key: NUTCH-1011
> URL: https://issues.apache.org/jira/browse/NUTCH-1011
> Project: Nutch
> Issue Type: Improvement
> Affects Versions: 1.4, 2.0
> Reporter: Markus Jelsma
> Assignee: Markus Jelsma
> Priority: Minor
> Fix For: 1.4, 2.0
>
> Attachments: NUTCH-1011-1.4-2.patch, NUTCH-1011-all-3.patch
>
>
> Many websites produce faulty URL's with multiple slashes e.g. http://cocoon.apache.org///////////////////////1.x/dynamic.html
> This can be really nasty if the number of slashes varies, resulting in many URL's actually pointing to the same page and generating new (unique) URL's to the same or other duplicate pages.
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1011) Normalize duplicate slashes in URL's
Posted by "Markus Jelsma (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/NUTCH-1011?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Markus Jelsma updated NUTCH-1011:
---------------------------------
Fix Version/s: 2.0
1.4
> Normalize duplicate slashes in URL's
> ------------------------------------
>
> Key: NUTCH-1011
> URL: https://issues.apache.org/jira/browse/NUTCH-1011
> Project: Nutch
> Issue Type: Improvement
> Affects Versions: 1.4, 2.0
> Reporter: Markus Jelsma
> Assignee: Markus Jelsma
> Priority: Minor
> Fix For: 1.4, 2.0
>
> Attachments: NUTCH-1011-all-3.patch
>
>
> Many websites produce faulty URL's with multiple slashes e.g. http://cocoon.apache.org///////////////////////1.x/dynamic.html
> This can be really nasty if the number of slashes varies, resulting in many URL's actually pointing to the same page and generating new (unique) URL's to the same or other duplicate pages.
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Resolved] (NUTCH-1011) Normalize duplicate slashes in URL's
Posted by "Markus Jelsma (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/NUTCH-1011?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Markus Jelsma resolved NUTCH-1011.
----------------------------------
Resolution: Fixed
Committed for 1.4 in rev. 1143467 and for trunk in rev. 1143468. Thanks Julien for reminding about the test.
> Normalize duplicate slashes in URL's
> ------------------------------------
>
> Key: NUTCH-1011
> URL: https://issues.apache.org/jira/browse/NUTCH-1011
> Project: Nutch
> Issue Type: Improvement
> Affects Versions: 1.4, 2.0
> Reporter: Markus Jelsma
> Assignee: Markus Jelsma
> Priority: Minor
> Fix For: 1.4, 2.0
>
> Attachments: NUTCH-1011-1.4-2.patch, NUTCH-1011-all-3.patch
>
>
> Many websites produce faulty URL's with multiple slashes e.g. http://cocoon.apache.org///////////////////////1.x/dynamic.html
> This can be really nasty if the number of slashes varies, resulting in many URL's actually pointing to the same page and generating new (unique) URL's to the same or other duplicate pages.
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1011) Normalize duplicate slashes in URL's
Posted by "Markus Jelsma (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/NUTCH-1011?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Markus Jelsma updated NUTCH-1011:
---------------------------------
Attachment: NUTCH-1011-all-2.patch
The previous regex seems to eat the character preceding the slashes as well. Here's a new patch using the following expression:
(?<!:)/{2,}
> Normalize duplicate slashes in URL's
> ------------------------------------
>
> Key: NUTCH-1011
> URL: https://issues.apache.org/jira/browse/NUTCH-1011
> Project: Nutch
> Issue Type: Improvement
> Affects Versions: 1.4, 2.0
> Reporter: Markus Jelsma
> Assignee: Markus Jelsma
> Priority: Minor
> Attachments: NUTCH-1011-all-2.patch, NUTCH-1011-all.patch
>
>
> Many websites produce faulty URL's with multiple slashes e.g. http://cocoon.apache.org///////////////////////1.x/dynamic.html
> This can be really nasty if the number of slashes varies, resulting in many URL's actually pointing to the same page and generating new (unique) URL's to the same or other duplicate pages.
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1011) Normalize duplicate slashes in URL's
Posted by "Markus Jelsma (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/NUTCH-1011?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Markus Jelsma updated NUTCH-1011:
---------------------------------
Attachment: NUTCH-1011-all-3.patch
HTML entities must be escaped properly!
> Normalize duplicate slashes in URL's
> ------------------------------------
>
> Key: NUTCH-1011
> URL: https://issues.apache.org/jira/browse/NUTCH-1011
> Project: Nutch
> Issue Type: Improvement
> Affects Versions: 1.4, 2.0
> Reporter: Markus Jelsma
> Assignee: Markus Jelsma
> Priority: Minor
> Attachments: NUTCH-1011-all-3.patch
>
>
> Many websites produce faulty URL's with multiple slashes e.g. http://cocoon.apache.org///////////////////////1.x/dynamic.html
> This can be really nasty if the number of slashes varies, resulting in many URL's actually pointing to the same page and generating new (unique) URL's to the same or other duplicate pages.
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1011) Normalize duplicate slashes in URL's
Posted by "Markus Jelsma (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/NUTCH-1011?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Markus Jelsma updated NUTCH-1011:
---------------------------------
Attachment: (was: NUTCH-1011-all.patch)
> Normalize duplicate slashes in URL's
> ------------------------------------
>
> Key: NUTCH-1011
> URL: https://issues.apache.org/jira/browse/NUTCH-1011
> Project: Nutch
> Issue Type: Improvement
> Affects Versions: 1.4, 2.0
> Reporter: Markus Jelsma
> Assignee: Markus Jelsma
> Priority: Minor
>
> Many websites produce faulty URL's with multiple slashes e.g. http://cocoon.apache.org///////////////////////1.x/dynamic.html
> This can be really nasty if the number of slashes varies, resulting in many URL's actually pointing to the same page and generating new (unique) URL's to the same or other duplicate pages.
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1011) Normalize duplicate slashes in
URL's
Posted by "Markus Jelsma (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/NUTCH-1011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13059458#comment-13059458 ]
Markus Jelsma commented on NUTCH-1011:
--------------------------------------
With NUTCH-1013 resolved, is patch eligible for inclusion?
> Normalize duplicate slashes in URL's
> ------------------------------------
>
> Key: NUTCH-1011
> URL: https://issues.apache.org/jira/browse/NUTCH-1011
> Project: Nutch
> Issue Type: Improvement
> Affects Versions: 1.4, 2.0
> Reporter: Markus Jelsma
> Assignee: Markus Jelsma
> Priority: Minor
> Fix For: 1.4, 2.0
>
> Attachments: NUTCH-1011-all-3.patch
>
>
> Many websites produce faulty URL's with multiple slashes e.g. http://cocoon.apache.org///////////////////////1.x/dynamic.html
> This can be really nasty if the number of slashes varies, resulting in many URL's actually pointing to the same page and generating new (unique) URL's to the same or other duplicate pages.
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1011) Normalize duplicate slashes in URL's
Posted by "Markus Jelsma (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/NUTCH-1011?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Markus Jelsma updated NUTCH-1011:
---------------------------------
Summary: Normalize duplicate slashes in URL's (was: Remove double slashes)
> Normalize duplicate slashes in URL's
> ------------------------------------
>
> Key: NUTCH-1011
> URL: https://issues.apache.org/jira/browse/NUTCH-1011
> Project: Nutch
> Issue Type: Improvement
> Affects Versions: 1.4, 2.0
> Reporter: Markus Jelsma
> Assignee: Markus Jelsma
> Priority: Minor
> Attachments: NUTCH-1011-all.patch
>
>
> Many websites produce faulty URL's with multiple slashes e.g. http://cocoon.apache.org///////////////////////1.x/dynamic.html
> This can be really nasty if the number of slashes varies, resulting in many URL's actually pointing to the same page and generating new (unique) URL's to the same or other duplicate pages.
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1011) Normalize duplicate slashes in
URL's
Posted by "Markus Jelsma (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/NUTCH-1011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13054438#comment-13054438 ]
Markus Jelsma commented on NUTCH-1011:
--------------------------------------
This normalizer works with NUTCH-1013.
{code}
<!-- removes duplicate slashes -->
<regex>
<pattern>(?<!:)/{2,}</pattern>
<substitution>/</substitution>
</regex>
{code}
> Normalize duplicate slashes in URL's
> ------------------------------------
>
> Key: NUTCH-1011
> URL: https://issues.apache.org/jira/browse/NUTCH-1011
> Project: Nutch
> Issue Type: Improvement
> Affects Versions: 1.4, 2.0
> Reporter: Markus Jelsma
> Assignee: Markus Jelsma
> Priority: Minor
> Attachments: NUTCH-1011-all-3.patch
>
>
> Many websites produce faulty URL's with multiple slashes e.g. http://cocoon.apache.org///////////////////////1.x/dynamic.html
> This can be really nasty if the number of slashes varies, resulting in many URL's actually pointing to the same page and generating new (unique) URL's to the same or other duplicate pages.
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1011) Normalize duplicate slashes in URL's
Posted by "Markus Jelsma (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/NUTCH-1011?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Markus Jelsma updated NUTCH-1011:
---------------------------------
Attachment: (was: NUTCH-1011-all-2.patch)
> Normalize duplicate slashes in URL's
> ------------------------------------
>
> Key: NUTCH-1011
> URL: https://issues.apache.org/jira/browse/NUTCH-1011
> Project: Nutch
> Issue Type: Improvement
> Affects Versions: 1.4, 2.0
> Reporter: Markus Jelsma
> Assignee: Markus Jelsma
> Priority: Minor
>
> Many websites produce faulty URL's with multiple slashes e.g. http://cocoon.apache.org///////////////////////1.x/dynamic.html
> This can be really nasty if the number of slashes varies, resulting in many URL's actually pointing to the same page and generating new (unique) URL's to the same or other duplicate pages.
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1011) Normalize duplicate slashes in URL's
Posted by "Markus Jelsma (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/NUTCH-1011?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Markus Jelsma updated NUTCH-1011:
---------------------------------
Attachment: NUTCH-1011-1.4-2.patch
Patch now includes unit test, which passes! Everything is happy again!
> Normalize duplicate slashes in URL's
> ------------------------------------
>
> Key: NUTCH-1011
> URL: https://issues.apache.org/jira/browse/NUTCH-1011
> Project: Nutch
> Issue Type: Improvement
> Affects Versions: 1.4, 2.0
> Reporter: Markus Jelsma
> Assignee: Markus Jelsma
> Priority: Minor
> Fix For: 1.4, 2.0
>
> Attachments: NUTCH-1011-1.4-2.patch, NUTCH-1011-all-3.patch
>
>
> Many websites produce faulty URL's with multiple slashes e.g. http://cocoon.apache.org///////////////////////1.x/dynamic.html
> This can be really nasty if the number of slashes varies, resulting in many URL's actually pointing to the same page and generating new (unique) URL's to the same or other duplicate pages.
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1011) Remove double slashes
Posted by "Markus Jelsma (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/NUTCH-1011?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Markus Jelsma updated NUTCH-1011:
---------------------------------
Attachment: NUTCH-1011-all.patch
Added new expression to detect double slashes that are _not_ part of the schema where a schema is identified as two slashes preceded by a colon:
<regex>
<pattern>[^:\/\/]\/{2,}</pattern>
<substitution>/</substitution>
</regex>
This keeps schema's embedded in the url intact e.g. http://example.org?url=http://example.org/
> Remove double slashes
> ---------------------
>
> Key: NUTCH-1011
> URL: https://issues.apache.org/jira/browse/NUTCH-1011
> Project: Nutch
> Issue Type: Improvement
> Affects Versions: 1.4, 2.0
> Reporter: Markus Jelsma
> Assignee: Markus Jelsma
> Priority: Minor
> Attachments: NUTCH-1011-all.patch
>
>
> Many websites produce faulty URL's with multiple slashes e.g. http://cocoon.apache.org///////////////////////1.x/dynamic.html
> This can be really nasty if the number of slashes varies, resulting in many URL's actually pointing to the same page and generating new (unique) URL's to the same or other duplicate pages.
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1011) Normalize duplicate slashes in
URL's
Posted by "Julien Nioche (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/NUTCH-1011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13060073#comment-13060073 ]
Julien Nioche commented on NUTCH-1011:
--------------------------------------
Is this case covered by the tests in org.apache.nutch.net.urlnormalizer.regex.TestRegexURLNormalizer?
> Normalize duplicate slashes in URL's
> ------------------------------------
>
> Key: NUTCH-1011
> URL: https://issues.apache.org/jira/browse/NUTCH-1011
> Project: Nutch
> Issue Type: Improvement
> Affects Versions: 1.4, 2.0
> Reporter: Markus Jelsma
> Assignee: Markus Jelsma
> Priority: Minor
> Fix For: 1.4, 2.0
>
> Attachments: NUTCH-1011-all-3.patch
>
>
> Many websites produce faulty URL's with multiple slashes e.g. http://cocoon.apache.org///////////////////////1.x/dynamic.html
> This can be really nasty if the number of slashes varies, resulting in many URL's actually pointing to the same page and generating new (unique) URL's to the same or other duplicate pages.
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Issue Comment Edited] (NUTCH-1011) Normalize duplicate
slashes in URL's
Posted by "Markus Jelsma (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/NUTCH-1011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13053937#comment-13053937 ]
Markus Jelsma edited comment on NUTCH-1011 at 6/23/11 3:55 PM:
---------------------------------------------------------------
The previous regex seems to eat the character preceding the slashes as well. Here's a new patch using the following expression:
{code}
(?<!:)/{2,}
{code}
was (Author: markus17):
The previous regex seems to eat the character preceding the slashes as well. Here's a new patch using the following expression:
(?<!:)/{2,}
> Normalize duplicate slashes in URL's
> ------------------------------------
>
> Key: NUTCH-1011
> URL: https://issues.apache.org/jira/browse/NUTCH-1011
> Project: Nutch
> Issue Type: Improvement
> Affects Versions: 1.4, 2.0
> Reporter: Markus Jelsma
> Assignee: Markus Jelsma
> Priority: Minor
> Attachments: NUTCH-1011-all-2.patch, NUTCH-1011-all.patch
>
>
> Many websites produce faulty URL's with multiple slashes e.g. http://cocoon.apache.org///////////////////////1.x/dynamic.html
> This can be really nasty if the number of slashes varies, resulting in many URL's actually pointing to the same page and generating new (unique) URL's to the same or other duplicate pages.
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1011) Normalize duplicate slashes in
URL's
Posted by "Hudson (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/NUTCH-1011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13061041#comment-13061041 ]
Hudson commented on NUTCH-1011:
-------------------------------
Integrated in Nutch-trunk #1538 (See [https://builds.apache.org/job/Nutch-trunk/1538/])
NUTCH-1011 Remove duplicate slashes from URLs
markus : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1143468
Files :
* /nutch/trunk/src/test/org/apache/nutch/net/TestURLNormalizers.java
* /nutch/trunk/conf/regex-normalize.xml.template
* /nutch/trunk/CHANGES.txt
> Normalize duplicate slashes in URL's
> ------------------------------------
>
> Key: NUTCH-1011
> URL: https://issues.apache.org/jira/browse/NUTCH-1011
> Project: Nutch
> Issue Type: Improvement
> Affects Versions: 1.4, 2.0
> Reporter: Markus Jelsma
> Assignee: Markus Jelsma
> Priority: Minor
> Fix For: 1.4, 2.0
>
> Attachments: NUTCH-1011-1.4-2.patch, NUTCH-1011-all-3.patch
>
>
> Many websites produce faulty URL's with multiple slashes e.g. http://cocoon.apache.org///////////////////////1.x/dynamic.html
> This can be really nasty if the number of slashes varies, resulting in many URL's actually pointing to the same page and generating new (unique) URL's to the same or other duplicate pages.
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1011) Normalize duplicate slashes in
URL's
Posted by "Markus Jelsma (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/NUTCH-1011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13053953#comment-13053953 ]
Markus Jelsma commented on NUTCH-1011:
--------------------------------------
Oh, it gets better. It seems the used engine cannot deal with my regex?
regex.RegexURLNormalizer - error parsing conf file: org.apache.oro.text.regex.MalformedPatternException: Sequence (?<...) not recognized
> Normalize duplicate slashes in URL's
> ------------------------------------
>
> Key: NUTCH-1011
> URL: https://issues.apache.org/jira/browse/NUTCH-1011
> Project: Nutch
> Issue Type: Improvement
> Affects Versions: 1.4, 2.0
> Reporter: Markus Jelsma
> Assignee: Markus Jelsma
> Priority: Minor
> Attachments: NUTCH-1011-all-3.patch
>
>
> Many websites produce faulty URL's with multiple slashes e.g. http://cocoon.apache.org///////////////////////1.x/dynamic.html
> This can be really nasty if the number of slashes varies, resulting in many URL's actually pointing to the same page and generating new (unique) URL's to the same or other duplicate pages.
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira