You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Markus Jelsma (JIRA)" <ji...@apache.org> on 2011/07/19 14:11:57 UTC

[jira] [Created] (NUTCH-1062) Migrate BasicURLNormalizer from Apache ORO to java.util.regex

Migrate BasicURLNormalizer from Apache ORO to java.util.regex
-------------------------------------------------------------

                 Key: NUTCH-1062
                 URL: https://issues.apache.org/jira/browse/NUTCH-1062
             Project: Nutch
          Issue Type: Improvement
            Reporter: Markus Jelsma
            Assignee: Markus Jelsma
             Fix For: 1.4, 2.0


Issue for migration from ORO to j.u.regex. There is a small problem here. I began the migration mostly because of the double slash issue using lookback which was not supported in ORO. This was to prevent the URL schema from being reduced to one slash. The current Basic URL Normalizer has this problem built-in!

{code}
        // this pattern tries to find spots like "xx//yy" in the url,
        // which could be replaced by a "/"
        adjacentSlashRule = new Rule();
        adjacentSlashRule.pattern = (Perl5Pattern)      
          compiler.compile("/{2,}", Perl5Compiler.READ_ONLY_MASK);     
        adjacentSlashRule.substitution = new Perl5Substitution("/");
{code}

But provides the wrong solution as it touches the schema as well. What to do? Migrate to j.u.regex and keep this `feature` intact?

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (NUTCH-1062) Migrate BasicURLNormalizer from Apache ORO to java.util.regex

Posted by "Markus Jelsma (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-1062?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Markus Jelsma updated NUTCH-1062:
---------------------------------

    Fix Version/s:     (was: 1.4)
                       (was: 2.0)
                   1.5
    
> Migrate BasicURLNormalizer from Apache ORO to java.util.regex
> -------------------------------------------------------------
>
>                 Key: NUTCH-1062
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1062
>             Project: Nutch
>          Issue Type: Improvement
>            Reporter: Markus Jelsma
>            Assignee: Markus Jelsma
>             Fix For: 1.5
>
>
> Issue for migration from ORO to j.u.regex. There is a small problem here. I began the migration mostly because of the double slash issue using lookback which was not supported in ORO. This was to prevent the URL schema from being reduced to one slash. The current Basic URL Normalizer has this problem built-in!
> {code}
>         // this pattern tries to find spots like "xx//yy" in the url,
>         // which could be replaced by a "/"
>         adjacentSlashRule = new Rule();
>         adjacentSlashRule.pattern = (Perl5Pattern)      
>           compiler.compile("/{2,}", Perl5Compiler.READ_ONLY_MASK);     
>         adjacentSlashRule.substitution = new Perl5Substitution("/");
> {code}
> But provides the wrong solution as it touches the schema as well. What to do? Migrate to j.u.regex and keep this `feature` intact? 
> edit: reading more it looks like it is being fixed at a later stage. A slash is added for URI schema's http & ftp.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (NUTCH-1062) Migrate BasicURLNormalizer from Apache ORO to java.util.regex

Posted by "Markus Jelsma (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-1062?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Markus Jelsma updated NUTCH-1062:
---------------------------------

    Description: 
Issue for migration from ORO to j.u.regex. There is a small problem here. I began the migration mostly because of the double slash issue using lookback which was not supported in ORO. This was to prevent the URL schema from being reduced to one slash. The current Basic URL Normalizer has this problem built-in!

{code}
        // this pattern tries to find spots like "xx//yy" in the url,
        // which could be replaced by a "/"
        adjacentSlashRule = new Rule();
        adjacentSlashRule.pattern = (Perl5Pattern)      
          compiler.compile("/{2,}", Perl5Compiler.READ_ONLY_MASK);     
        adjacentSlashRule.substitution = new Perl5Substitution("/");
{code}

But provides the wrong solution as it touches the schema as well. What to do? Migrate to j.u.regex and keep this `feature` intact? 

edit: reading more it looks like it is being fixed at a later stage. A slash is added for URI schema's http & ftp.

  was:
Issue for migration from ORO to j.u.regex. There is a small problem here. I began the migration mostly because of the double slash issue using lookback which was not supported in ORO. This was to prevent the URL schema from being reduced to one slash. The current Basic URL Normalizer has this problem built-in!

{code}
        // this pattern tries to find spots like "xx//yy" in the url,
        // which could be replaced by a "/"
        adjacentSlashRule = new Rule();
        adjacentSlashRule.pattern = (Perl5Pattern)      
          compiler.compile("/{2,}", Perl5Compiler.READ_ONLY_MASK);     
        adjacentSlashRule.substitution = new Perl5Substitution("/");
{code}

But provides the wrong solution as it touches the schema as well. What to do? Migrate to j.u.regex and keep this `feature` intact?


> Migrate BasicURLNormalizer from Apache ORO to java.util.regex
> -------------------------------------------------------------
>
>                 Key: NUTCH-1062
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1062
>             Project: Nutch
>          Issue Type: Improvement
>            Reporter: Markus Jelsma
>            Assignee: Markus Jelsma
>             Fix For: 1.4, 2.0
>
>
> Issue for migration from ORO to j.u.regex. There is a small problem here. I began the migration mostly because of the double slash issue using lookback which was not supported in ORO. This was to prevent the URL schema from being reduced to one slash. The current Basic URL Normalizer has this problem built-in!
> {code}
>         // this pattern tries to find spots like "xx//yy" in the url,
>         // which could be replaced by a "/"
>         adjacentSlashRule = new Rule();
>         adjacentSlashRule.pattern = (Perl5Pattern)      
>           compiler.compile("/{2,}", Perl5Compiler.READ_ONLY_MASK);     
>         adjacentSlashRule.substitution = new Perl5Substitution("/");
> {code}
> But provides the wrong solution as it touches the schema as well. What to do? Migrate to j.u.regex and keep this `feature` intact? 
> edit: reading more it looks like it is being fixed at a later stage. A slash is added for URI schema's http & ftp.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (NUTCH-1062) Migrate BasicURLNormalizer from Apache ORO to java.util.regex

Posted by "Markus Jelsma (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-1062?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Markus Jelsma updated NUTCH-1062:
---------------------------------

    Fix Version/s:     (was: 1.5)
                   1.6

20120304-push-1.6
                
> Migrate BasicURLNormalizer from Apache ORO to java.util.regex
> -------------------------------------------------------------
>
>                 Key: NUTCH-1062
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1062
>             Project: Nutch
>          Issue Type: Improvement
>            Reporter: Markus Jelsma
>            Assignee: Markus Jelsma
>             Fix For: 1.6
>
>
> Issue for migration from ORO to j.u.regex. There is a small problem here. I began the migration mostly because of the double slash issue using lookback which was not supported in ORO. This was to prevent the URL schema from being reduced to one slash. The current Basic URL Normalizer has this problem built-in!
> {code}
>         // this pattern tries to find spots like "xx//yy" in the url,
>         // which could be replaced by a "/"
>         adjacentSlashRule = new Rule();
>         adjacentSlashRule.pattern = (Perl5Pattern)      
>           compiler.compile("/{2,}", Perl5Compiler.READ_ONLY_MASK);     
>         adjacentSlashRule.substitution = new Perl5Substitution("/");
> {code}
> But provides the wrong solution as it touches the schema as well. What to do? Migrate to j.u.regex and keep this `feature` intact? 
> edit: reading more it looks like it is being fixed at a later stage. A slash is added for URI schema's http & ftp.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira