You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Radim Kolar (JIRA)" <ji...@apache.org> on 2011/08/26 01:32:29 UTC

[jira] [Created] (NUTCH-1098) better url-normalizer basic

better url-normalizer basic
---------------------------

                 Key: NUTCH-1098
                 URL: https://issues.apache.org/jira/browse/NUTCH-1098
             Project: Nutch
          Issue Type: Improvement
          Components: fetcher
    Affects Versions: 1.3
         Environment: Any
            Reporter: Radim Kolar
             Fix For: 1.4
         Attachments: urlnormalizer.patch

Basic URL normalizer lacks 2 important features

Encode space in URL into %20 to unbreak httpclient and possibly others who do not expect space inside URL

Ability to decode %33 encoding in URL. This is important for avoiding duplicates

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (NUTCH-1098) better url-normalizer basic

Posted by "Markus Jelsma (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-1098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13140170#comment-13140170 ] 

Markus Jelsma commented on NUTCH-1098:
--------------------------------------

Path prefixes can be overcome by using -p1 with patch. Indenting will be looked after.

I also believe we should to disable this feature by default as existing crawls are very much affected by this issue and need to be taken care off. One would have to renormalize the crawldb (and linkdb and webgraphdb if in use) which is trivial. However, the keys in existing segments are not updated so indexing the data won't work anymore.
                
> better url-normalizer basic
> ---------------------------
>
>                 Key: NUTCH-1098
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1098
>             Project: Nutch
>          Issue Type: Improvement
>          Components: fetcher
>    Affects Versions: 1.3
>         Environment: Any
>            Reporter: Radim Kolar
>            Assignee: Markus Jelsma
>              Labels: encoding, url
>             Fix For: 1.5
>
>         Attachments: patch-urlnormalizer.diff
>
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> Basic URL normalizer lacks 2 important features
> Encode space in URL into %20 to unbreak httpclient and possibly others who do not expect space inside URL
> Ability to decode %33 encoding in URL. This is important for avoiding duplicates

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (NUTCH-1098) better url-normalizer basic

Posted by "Ferdy (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-1098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13140167#comment-13140167 ] 

Ferdy commented on NUTCH-1098:
------------------------------

+1 for encoding spaces into %20, both in file and query. (Although encoding spaces in the query part into `+` is also fine with me.)

+1 for uppercasing escaped chars. (%1f --> %1F)

I propose to split this issue into separate issues, as this makes discussing the individual features a lot easier. (Also making it easier to commit changes that have smaller overall impact such as the %20 feature).

Also, the latest patch still has some flaws. (wrong path as files are prefixed with 'a/ and some indent issues here and there.).
                
> better url-normalizer basic
> ---------------------------
>
>                 Key: NUTCH-1098
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1098
>             Project: Nutch
>          Issue Type: Improvement
>          Components: fetcher
>    Affects Versions: 1.3
>         Environment: Any
>            Reporter: Radim Kolar
>            Assignee: Markus Jelsma
>              Labels: encoding, url
>             Fix For: 1.5
>
>         Attachments: patch-urlnormalizer.diff
>
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> Basic URL normalizer lacks 2 important features
> Encode space in URL into %20 to unbreak httpclient and possibly others who do not expect space inside URL
> Ability to decode %33 encoding in URL. This is important for avoiding duplicates

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (NUTCH-1098) better url-normalizer basic

Posted by "Radim Kolar (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-1098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13144020#comment-13144020 ] 

Radim Kolar commented on NUTCH-1098:
------------------------------------

By removing my patch i also withdraw permission which i previously granted to ASF to distribute my copyrighted works.
                
> better url-normalizer basic
> ---------------------------
>
>                 Key: NUTCH-1098
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1098
>             Project: Nutch
>          Issue Type: Improvement
>          Components: fetcher
>    Affects Versions: 1.3
>         Environment: Any
>            Reporter: Radim Kolar
>            Assignee: Markus Jelsma
>              Labels: encoding, url
>             Fix For: 1.5
>
>         Attachments: patch-with-utf8-encoding.diff
>
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> Basic URL normalizer lacks 2 important features
> Encode space in URL into %20 to unbreak httpclient and possibly others who do not expect space inside URL
> Ability to decode %33 encoding in URL. This is important for avoiding duplicates

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (NUTCH-1098) better url-normalizer basic

Posted by "Radim Kolar (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-1098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13120912#comment-13120912 ] 

Radim Kolar commented on NUTCH-1098:
------------------------------------

1. Some servers sends spaces in URLs
2. Based on their web framework, if they use URL rewriting some characters are % encoded by it. For example ~ is very often encoded
                
> better url-normalizer basic
> ---------------------------
>
>                 Key: NUTCH-1098
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1098
>             Project: Nutch
>          Issue Type: Improvement
>          Components: fetcher
>    Affects Versions: 1.3
>         Environment: Any
>            Reporter: Radim Kolar
>            Assignee: Markus Jelsma
>              Labels: encoding, url
>             Fix For: 1.4
>
>         Attachments: nutch.diff
>
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> Basic URL normalizer lacks 2 important features
> Encode space in URL into %20 to unbreak httpclient and possibly others who do not expect space inside URL
> Ability to decode %33 encoding in URL. This is important for avoiding duplicates

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (NUTCH-1098) better url-normalizer basic

Posted by "Markus Jelsma (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-1098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13120886#comment-13120886 ] 

Markus Jelsma commented on NUTCH-1098:
--------------------------------------

What's the real life use-case for decoding? I've not yet seen servers returning URL's where that range for some reason is URL encoded.
                
> better url-normalizer basic
> ---------------------------
>
>                 Key: NUTCH-1098
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1098
>             Project: Nutch
>          Issue Type: Improvement
>          Components: fetcher
>    Affects Versions: 1.3
>         Environment: Any
>            Reporter: Radim Kolar
>            Assignee: Markus Jelsma
>              Labels: encoding, url
>             Fix For: 1.4
>
>         Attachments: nutch.diff
>
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> Basic URL normalizer lacks 2 important features
> Encode space in URL into %20 to unbreak httpclient and possibly others who do not expect space inside URL
> Ability to decode %33 encoding in URL. This is important for avoiding duplicates

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (NUTCH-1098) better url-normalizer basic

Posted by "Lewis John McGibbney (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-1098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13127434#comment-13127434 ] 

Lewis John McGibbney commented on NUTCH-1098:
---------------------------------------------

Does anyone have comments on this one? Thanks for including the test with this patch Radim. I will test and comment accordingly (hopefully) tonight. Have you considered Sebastians comments?
                
> better url-normalizer basic
> ---------------------------
>
>                 Key: NUTCH-1098
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1098
>             Project: Nutch
>          Issue Type: Improvement
>          Components: fetcher
>    Affects Versions: 1.3
>         Environment: Any
>            Reporter: Radim Kolar
>            Assignee: Markus Jelsma
>              Labels: encoding, url
>             Fix For: 1.4
>
>         Attachments: nutch.diff
>
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> Basic URL normalizer lacks 2 important features
> Encode space in URL into %20 to unbreak httpclient and possibly others who do not expect space inside URL
> Ability to decode %33 encoding in URL. This is important for avoiding duplicates

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (NUTCH-1098) better url-normalizer basic

Posted by "Ferdy Galema (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-1098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13144025#comment-13144025 ] 

Ferdy Galema commented on NUTCH-1098:
-------------------------------------

Radim, that's funny. I don't believe that is possible.

Also, aside from the fact that you could not deal with the feedback about the patch quality, I simply did not agree with the entire contents of the patch. There are some parts in it which I did not understand yet (or still don't). You failed to deal with this feedback too. So please do not fool yourself that this is simply about the patch quality.
                
> better url-normalizer basic
> ---------------------------
>
>                 Key: NUTCH-1098
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1098
>             Project: Nutch
>          Issue Type: Improvement
>          Components: fetcher
>    Affects Versions: 1.3
>         Environment: Any
>            Reporter: Radim Kolar
>            Assignee: Markus Jelsma
>              Labels: encoding, url
>             Fix For: 1.5
>
>         Attachments: patch-with-utf8-encoding.diff
>
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> Basic URL normalizer lacks 2 important features
> Encode space in URL into %20 to unbreak httpclient and possibly others who do not expect space inside URL
> Ability to decode %33 encoding in URL. This is important for avoiding duplicates

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (NUTCH-1098) better url-normalizer basic

Posted by "Radim Kolar (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-1098?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Radim Kolar updated NUTCH-1098:
-------------------------------

    Attachment:     (was: nutch.diff)
    
> better url-normalizer basic
> ---------------------------
>
>                 Key: NUTCH-1098
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1098
>             Project: Nutch
>          Issue Type: Improvement
>          Components: fetcher
>    Affects Versions: 1.3
>         Environment: Any
>            Reporter: Radim Kolar
>            Assignee: Markus Jelsma
>              Labels: encoding, url
>             Fix For: 1.5
>
>         Attachments: patch-urlnormalizer.diff
>
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> Basic URL normalizer lacks 2 important features
> Encode space in URL into %20 to unbreak httpclient and possibly others who do not expect space inside URL
> Ability to decode %33 encoding in URL. This is important for avoiding duplicates

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (NUTCH-1098) better url-normalizer basic

Posted by "Radim Kolar (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-1098?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Radim Kolar updated NUTCH-1098:
-------------------------------

    Attachment:     (was: patch-urlnormalizer.diff)
    
> better url-normalizer basic
> ---------------------------
>
>                 Key: NUTCH-1098
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1098
>             Project: Nutch
>          Issue Type: Improvement
>          Components: fetcher
>    Affects Versions: 1.3
>         Environment: Any
>            Reporter: Radim Kolar
>            Assignee: Markus Jelsma
>              Labels: encoding, url
>             Fix For: 1.5
>
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> Basic URL normalizer lacks 2 important features
> Encode space in URL into %20 to unbreak httpclient and possibly others who do not expect space inside URL
> Ability to decode %33 encoding in URL. This is important for avoiding duplicates

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Assigned] (NUTCH-1098) better url-normalizer basic

Posted by "Markus Jelsma (Assigned) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-1098?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Markus Jelsma reassigned NUTCH-1098:
------------------------------------

    Assignee: Markus Jelsma
    
> better url-normalizer basic
> ---------------------------
>
>                 Key: NUTCH-1098
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1098
>             Project: Nutch
>          Issue Type: Improvement
>          Components: fetcher
>    Affects Versions: 1.3
>         Environment: Any
>            Reporter: Radim Kolar
>            Assignee: Markus Jelsma
>              Labels: encoding, url
>             Fix For: 1.4
>
>         Attachments: nutch.diff
>
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> Basic URL normalizer lacks 2 important features
> Encode space in URL into %20 to unbreak httpclient and possibly others who do not expect space inside URL
> Ability to decode %33 encoding in URL. This is important for avoiding duplicates

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (NUTCH-1098) better url-normalizer basic

Posted by "Lewis John McGibbney (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-1098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13125429#comment-13125429 ] 

Lewis John McGibbney commented on NUTCH-1098:
---------------------------------------------

Hi Radim are you happy with this patch as is? Do you intend on adding in your latter comments?
                
> better url-normalizer basic
> ---------------------------
>
>                 Key: NUTCH-1098
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1098
>             Project: Nutch
>          Issue Type: Improvement
>          Components: fetcher
>    Affects Versions: 1.3
>         Environment: Any
>            Reporter: Radim Kolar
>            Assignee: Markus Jelsma
>              Labels: encoding, url
>             Fix For: 1.4
>
>         Attachments: nutch.diff
>
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> Basic URL normalizer lacks 2 important features
> Encode space in URL into %20 to unbreak httpclient and possibly others who do not expect space inside URL
> Ability to decode %33 encoding in URL. This is important for avoiding duplicates

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (NUTCH-1098) better url-normalizer basic

Posted by "Radim Kolar (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-1098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13142699#comment-13142699 ] 

Radim Kolar commented on NUTCH-1098:
------------------------------------

a/ Please direct your complains about quality of git generated patches to git mailing list. i am not going to generate patches for you manually by running diff -Naur

b/ if you used something better then SVN (hg,git,bzr) you can cherrypick changes from my branch, create new branches for every subtask and attaching branches to JIRA reports and then you can discuss them separately.

c/ more efficient is if i dont spend more 10x more time in pointless discussions then on coding
                
> better url-normalizer basic
> ---------------------------
>
>                 Key: NUTCH-1098
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1098
>             Project: Nutch
>          Issue Type: Improvement
>          Components: fetcher
>    Affects Versions: 1.3
>         Environment: Any
>            Reporter: Radim Kolar
>            Assignee: Markus Jelsma
>              Labels: encoding, url
>             Fix For: 1.5
>
>         Attachments: patch-urlnormalizer.diff
>
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> Basic URL normalizer lacks 2 important features
> Encode space in URL into %20 to unbreak httpclient and possibly others who do not expect space inside URL
> Ability to decode %33 encoding in URL. This is important for avoiding duplicates

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (NUTCH-1098) better url-normalizer basic

Posted by "Markus Jelsma (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-1098?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Markus Jelsma updated NUTCH-1098:
---------------------------------

    Attachment: patch-with-utf8-encoding.diff

Restored original patch.
                
> better url-normalizer basic
> ---------------------------
>
>                 Key: NUTCH-1098
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1098
>             Project: Nutch
>          Issue Type: Improvement
>          Components: fetcher
>    Affects Versions: 1.3
>         Environment: Any
>            Reporter: Radim Kolar
>            Assignee: Markus Jelsma
>              Labels: encoding, url
>             Fix For: 1.5
>
>         Attachments: patch-with-utf8-encoding.diff
>
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> Basic URL normalizer lacks 2 important features
> Encode space in URL into %20 to unbreak httpclient and possibly others who do not expect space inside URL
> Ability to decode %33 encoding in URL. This is important for avoiding duplicates

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (NUTCH-1098) better url-normalizer basic

Posted by "Radim Kolar (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-1098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13127440#comment-13127440 ] 

Radim Kolar commented on NUTCH-1098:
------------------------------------

I did, but due to lack of time to test what current browsers do with spaces in query, i will leave it as it is.
                
> better url-normalizer basic
> ---------------------------
>
>                 Key: NUTCH-1098
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1098
>             Project: Nutch
>          Issue Type: Improvement
>          Components: fetcher
>    Affects Versions: 1.3
>         Environment: Any
>            Reporter: Radim Kolar
>            Assignee: Markus Jelsma
>              Labels: encoding, url
>             Fix For: 1.4
>
>         Attachments: nutch.diff
>
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> Basic URL normalizer lacks 2 important features
> Encode space in URL into %20 to unbreak httpclient and possibly others who do not expect space inside URL
> Ability to decode %33 encoding in URL. This is important for avoiding duplicates

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (NUTCH-1098) better url-normalizer basic

Posted by "Radim Kolar (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-1098?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Radim Kolar updated NUTCH-1098:
-------------------------------

    Attachment: urlnormalizer.patch

Patch against branch-1.4

> better url-normalizer basic
> ---------------------------
>
>                 Key: NUTCH-1098
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1098
>             Project: Nutch
>          Issue Type: Improvement
>          Components: fetcher
>    Affects Versions: 1.3
>         Environment: Any
>            Reporter: Radim Kolar
>              Labels: encoding, url
>             Fix For: 1.4
>
>         Attachments: urlnormalizer.patch
>
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> Basic URL normalizer lacks 2 important features
> Encode space in URL into %20 to unbreak httpclient and possibly others who do not expect space inside URL
> Ability to decode %33 encoding in URL. This is important for avoiding duplicates

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (NUTCH-1098) better url-normalizer basic

Posted by "Radim Kolar (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-1098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13144160#comment-13144160 ] 

Radim Kolar commented on NUTCH-1098:
------------------------------------

If you are so clever and hard working then stop undeleting my patch and write better one yourself. I am licensing my work as Affero GPL v3 from now.

You simply need months to discuss trivial code change. Everybody here claims to be smart like TV and hard working like black men but look at your results: mere 13 trivial commits in October. Look at my results i have 2.1 billions files indexed in 4 months.

I reworked major portion of nutch and i dont want to spend years waiting if they and ever and when will be merged. I have hadoop 0.21 api, generator with plugable algorithm, fixed building with maven, database backend switched to cassandra and other stuff. For me is far better to just pull 20 yours patches per month from github and dont waste my time with you in pointless discussions like git vs svn diff format.
                
> better url-normalizer basic
> ---------------------------
>
>                 Key: NUTCH-1098
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1098
>             Project: Nutch
>          Issue Type: Improvement
>          Components: fetcher
>    Affects Versions: 1.3
>         Environment: Any
>            Reporter: Radim Kolar
>            Assignee: Markus Jelsma
>              Labels: encoding, url
>             Fix For: 1.5
>
>         Attachments: patch-with-utf8-encoding.diff
>
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> Basic URL normalizer lacks 2 important features
> Encode space in URL into %20 to unbreak httpclient and possibly others who do not expect space inside URL
> Ability to decode %33 encoding in URL. This is important for avoiding duplicates

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (NUTCH-1098) better url-normalizer basic

Posted by "Markus Jelsma (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-1098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13131635#comment-13131635 ] 

Markus Jelsma commented on NUTCH-1098:
--------------------------------------

I am happy with the encoding of space to %20 but i am still not sure if decoding is without problems, especially decoding %2F to / (slash) as both have a different meaning as slash denotes the path. I went through a part of our CrawlDB searching for %2F occurences and got a list. I tried several URL's alternating between %2F and /. There are at least a few URL's that return different content.

This is correct:
http://www.detelefoongids.nl/bg-l/18236834-B%2Fmak+Bedrijfsmakelaars/vermelding/

But this is incorrect:
http://www.detelefoongids.nl/bg-l/18236834-B/mak+Bedrijfsmakelaars/vermelding/

This also applies for decoding %23 to # (hash). By decoding this character valid URL's are seen having a URL fragment which is then normalized away. See these examples:

http://www.noordhollandsdagblad.nl/nieuws/stadstreek/denhelder/article9516101.ece/%23Nieuwedweep%3A-Ambassadeur-van-Den-Helder-zijn-we-allemaal
http://www.motorstek.nl/gebruiker/fido%20%2377

In the above examples the encoding of the hash is intented and correct. The content contains a hash but the URL does not contain a fragment but this would: http://www.motorstek.nl/gebruiker/fido%20%2377#part-of-page

Another example of bad decoding is : http://st-annaland.citysite.nl/nieuws/6276724/2816_Rosenthal+ontvangt+oppositieleider+Syri%26%23235%3B.html.

I propose not to decode characters that have special meaning in URI's and be very careful in tests. As this is a very crucial issue and the implementation is not sound yet i propose to push this to 1.5 as i'm not very comfortable by introducing a potential big issue just before 1.4 is to be released.
                
> better url-normalizer basic
> ---------------------------
>
>                 Key: NUTCH-1098
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1098
>             Project: Nutch
>          Issue Type: Improvement
>          Components: fetcher
>    Affects Versions: 1.3
>         Environment: Any
>            Reporter: Radim Kolar
>            Assignee: Markus Jelsma
>              Labels: encoding, url
>             Fix For: 1.4
>
>         Attachments: nutch.diff, patch-urlnormalizer.diff
>
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> Basic URL normalizer lacks 2 important features
> Encode space in URL into %20 to unbreak httpclient and possibly others who do not expect space inside URL
> Ability to decode %33 encoding in URL. This is important for avoiding duplicates

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (NUTCH-1098) better url-normalizer basic

Posted by "Radim Kolar (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-1098?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Radim Kolar updated NUTCH-1098:
-------------------------------

    Attachment:     (was: urlnormalizer.patch)

> better url-normalizer basic
> ---------------------------
>
>                 Key: NUTCH-1098
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1098
>             Project: Nutch
>          Issue Type: Improvement
>          Components: fetcher
>    Affects Versions: 1.3
>         Environment: Any
>            Reporter: Radim Kolar
>              Labels: encoding, url
>             Fix For: 1.4
>
>         Attachments: nutch.diff
>
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> Basic URL normalizer lacks 2 important features
> Encode space in URL into %20 to unbreak httpclient and possibly others who do not expect space inside URL
> Ability to decode %33 encoding in URL. This is important for avoiding duplicates

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (NUTCH-1098) better url-normalizer basic

Posted by "Chris A. Mattmann (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-1098?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Chris A. Mattmann updated NUTCH-1098:
-------------------------------------

    Fix Version/s:     (was: 1.4)
                   1.5

- push 
                
> better url-normalizer basic
> ---------------------------
>
>                 Key: NUTCH-1098
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1098
>             Project: Nutch
>          Issue Type: Improvement
>          Components: fetcher
>    Affects Versions: 1.3
>         Environment: Any
>            Reporter: Radim Kolar
>            Assignee: Markus Jelsma
>              Labels: encoding, url
>             Fix For: 1.5
>
>         Attachments: nutch.diff, patch-urlnormalizer.diff
>
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> Basic URL normalizer lacks 2 important features
> Encode space in URL into %20 to unbreak httpclient and possibly others who do not expect space inside URL
> Ability to decode %33 encoding in URL. This is important for avoiding duplicates

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (NUTCH-1098) better url-normalizer basic

Posted by "Lewis John McGibbney (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-1098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13129850#comment-13129850 ] 

Lewis John McGibbney commented on NUTCH-1098:
---------------------------------------------

I think this makes sense, but yes number two above is important as the problem scales in parallel with respect to crawl job size => big problems w.r.t index maintenance issues. 

Radim, can you please update your patch as it does not apply cleanly to most recent trunk. Once this has been done we can make a decision. Thank you 
                
> better url-normalizer basic
> ---------------------------
>
>                 Key: NUTCH-1098
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1098
>             Project: Nutch
>          Issue Type: Improvement
>          Components: fetcher
>    Affects Versions: 1.3
>         Environment: Any
>            Reporter: Radim Kolar
>            Assignee: Markus Jelsma
>              Labels: encoding, url
>             Fix For: 1.4
>
>         Attachments: nutch.diff
>
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> Basic URL normalizer lacks 2 important features
> Encode space in URL into %20 to unbreak httpclient and possibly others who do not expect space inside URL
> Ability to decode %33 encoding in URL. This is important for avoiding duplicates

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (NUTCH-1098) better url-normalizer basic

Posted by "Radim Kolar (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-1098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13126594#comment-13126594 ] 

Radim Kolar commented on NUTCH-1098:
------------------------------------

Patch is good. i will add replace high bit chars with %82 in next one
                
> better url-normalizer basic
> ---------------------------
>
>                 Key: NUTCH-1098
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1098
>             Project: Nutch
>          Issue Type: Improvement
>          Components: fetcher
>    Affects Versions: 1.3
>         Environment: Any
>            Reporter: Radim Kolar
>            Assignee: Markus Jelsma
>              Labels: encoding, url
>             Fix For: 1.4
>
>         Attachments: nutch.diff
>
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> Basic URL normalizer lacks 2 important features
> Encode space in URL into %20 to unbreak httpclient and possibly others who do not expect space inside URL
> Ability to decode %33 encoding in URL. This is important for avoiding duplicates

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (NUTCH-1098) better url-normalizer basic

Posted by "Julien Nioche (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-1098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13143038#comment-13143038 ] 

Julien Nioche commented on NUTCH-1098:
--------------------------------------

@Radim

Sounds like "I am not going to" is your favourite phrase.

You certainly prefer GIT to SVN, but the fact is that Nutch uses the latter and contributions are expected to be generated with 'svn diff'. By going on with what the rest of the community do (vs imposing your ways to others) you will make it easier for people to discuss, review and commit your contributions.

http://wiki.apache.org/nutch/HowToContribute >> Creating a patch
http://www.apache.org/foundation/how-it-works.html  >> Philosophy

Thanks
                
> better url-normalizer basic
> ---------------------------
>
>                 Key: NUTCH-1098
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1098
>             Project: Nutch
>          Issue Type: Improvement
>          Components: fetcher
>    Affects Versions: 1.3
>         Environment: Any
>            Reporter: Radim Kolar
>            Assignee: Markus Jelsma
>              Labels: encoding, url
>             Fix For: 1.5
>
>         Attachments: patch-with-utf8-encoding.diff
>
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> Basic URL normalizer lacks 2 important features
> Encode space in URL into %20 to unbreak httpclient and possibly others who do not expect space inside URL
> Ability to decode %33 encoding in URL. This is important for avoiding duplicates

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (NUTCH-1098) better url-normalizer basic

Posted by "Radim Kolar (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-1098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13144183#comment-13144183 ] 

Radim Kolar commented on NUTCH-1098:
------------------------------------

Remove my patch from this ticket. I hold copyrights on it and i have right to decide what to do with my copyrighted work. I don't want it to be included.
                
> better url-normalizer basic
> ---------------------------
>
>                 Key: NUTCH-1098
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1098
>             Project: Nutch
>          Issue Type: Improvement
>          Components: fetcher
>    Affects Versions: 1.3
>         Environment: Any
>            Reporter: Radim Kolar
>            Assignee: Markus Jelsma
>              Labels: encoding, url
>             Fix For: 1.5
>
>         Attachments: patch-with-utf8-encoding.diff
>
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> Basic URL normalizer lacks 2 important features
> Encode space in URL into %20 to unbreak httpclient and possibly others who do not expect space inside URL
> Ability to decode %33 encoding in URL. This is important for avoiding duplicates

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Issue Comment Edited] (NUTCH-1098) better url-normalizer basic

Posted by "Ferdy (Issue Comment Edited) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-1098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13140167#comment-13140167 ] 

Ferdy edited comment on NUTCH-1098 at 10/31/11 1:49 PM:
--------------------------------------------------------

+1 for encoding spaces into %20, both in file and query. (Although encoding spaces in the query part into 'plus' is also fine with me.)

+1 for uppercasing escaped chars. (%1f --> %1F)

I propose to split this issue into separate issues, as this makes discussing the individual features a lot easier. (Also making it easier to commit changes that have smaller overall impact such as the %20 feature).

Also, the latest patch still has some flaws. (Wrong path as files are prefixed with 'a/' and some indent issues here and there.)
                
      was (Author: ferdy.g):
    +1 for encoding spaces into %20, both in file and query. (Although encoding spaces in the query part into `+` is also fine with me.)

+1 for uppercasing escaped chars. (%1f --> %1F)

I propose to split this issue into separate issues, as this makes discussing the individual features a lot easier. (Also making it easier to commit changes that have smaller overall impact such as the %20 feature).

Also, the latest patch still has some flaws. (wrong path as files are prefixed with 'a/ and some indent issues here and there.).
                  
> better url-normalizer basic
> ---------------------------
>
>                 Key: NUTCH-1098
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1098
>             Project: Nutch
>          Issue Type: Improvement
>          Components: fetcher
>    Affects Versions: 1.3
>         Environment: Any
>            Reporter: Radim Kolar
>            Assignee: Markus Jelsma
>              Labels: encoding, url
>             Fix For: 1.5
>
>         Attachments: patch-urlnormalizer.diff
>
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> Basic URL normalizer lacks 2 important features
> Encode space in URL into %20 to unbreak httpclient and possibly others who do not expect space inside URL
> Ability to decode %33 encoding in URL. This is important for avoiding duplicates

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (NUTCH-1098) better url-normalizer basic

Posted by "Lewis John McGibbney (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-1098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13131591#comment-13131591 ] 

Lewis John McGibbney commented on NUTCH-1098:
---------------------------------------------

Hi Markus, do you have any comments on Radim's latest patch?
                
> better url-normalizer basic
> ---------------------------
>
>                 Key: NUTCH-1098
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1098
>             Project: Nutch
>          Issue Type: Improvement
>          Components: fetcher
>    Affects Versions: 1.3
>         Environment: Any
>            Reporter: Radim Kolar
>            Assignee: Markus Jelsma
>              Labels: encoding, url
>             Fix For: 1.4
>
>         Attachments: nutch.diff, patch-urlnormalizer.diff
>
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> Basic URL normalizer lacks 2 important features
> Encode space in URL into %20 to unbreak httpclient and possibly others who do not expect space inside URL
> Ability to decode %33 encoding in URL. This is important for avoiding duplicates

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (NUTCH-1098) better url-normalizer basic

Posted by "Radim Kolar (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-1098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13120966#comment-13120966 ] 

Radim Kolar commented on NUTCH-1098:
------------------------------------

Actually it might be even better to add capability to encode high bit chars in normalizer using UTF-8
                
> better url-normalizer basic
> ---------------------------
>
>                 Key: NUTCH-1098
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1098
>             Project: Nutch
>          Issue Type: Improvement
>          Components: fetcher
>    Affects Versions: 1.3
>         Environment: Any
>            Reporter: Radim Kolar
>            Assignee: Markus Jelsma
>              Labels: encoding, url
>             Fix For: 1.4
>
>         Attachments: nutch.diff
>
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> Basic URL normalizer lacks 2 important features
> Encode space in URL into %20 to unbreak httpclient and possibly others who do not expect space inside URL
> Ability to decode %33 encoding in URL. This is important for avoiding duplicates

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (NUTCH-1098) better url-normalizer basic

Posted by "Sebastian Nagel (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-1098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13128372#comment-13128372 ] 

Sebastian Nagel commented on NUTCH-1098:
----------------------------------------

It depends on the used browser and where the query is entered (in the URL line or in a form field, cf. http://en.wikipedia.org/wiki/URL_encoding#The_application.2Fx-www-form-urlencoded_type). 
>From the user perspective it seems to be irrelevant how a query containing spaces is encoded. For both
 http://www.google.com/search?q=nutch+crawler
 http://www.google.com/search?q=nutch%20crawler
you get the same result page. Servlets are not sensitive to whether `+' or `%20' represents a space (at least, if standard decoding routines are used).

But despite the question what browsers are doing a URL normalizer should replace the space because:
 1. fetching URLs containing spaces may fail (it will with httpclient)
 2. avoid fetching the same document twice by two variant URLs

Since the first and more severe problem is fixed I opt for applying the patch "as is". But we should keep point 2 on the radar for a later improvement.
I think BasicURLNormalizer is the right place for a solution because a position-dependent replacement by `%20' or `+' before resp. after `?' can hardly be expressed by a single regular expression and hence cannot be done by adding a rule to RegexURLNormalizer.
                
> better url-normalizer basic
> ---------------------------
>
>                 Key: NUTCH-1098
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1098
>             Project: Nutch
>          Issue Type: Improvement
>          Components: fetcher
>    Affects Versions: 1.3
>         Environment: Any
>            Reporter: Radim Kolar
>            Assignee: Markus Jelsma
>              Labels: encoding, url
>             Fix For: 1.4
>
>         Attachments: nutch.diff
>
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> Basic URL normalizer lacks 2 important features
> Encode space in URL into %20 to unbreak httpclient and possibly others who do not expect space inside URL
> Ability to decode %33 encoding in URL. This is important for avoiding duplicates

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (NUTCH-1098) better url-normalizer basic

Posted by "Sebastian Nagel (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-1098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13125965#comment-13125965 ] 

Sebastian Nagel commented on NUTCH-1098:
----------------------------------------

Spaces in URLs are quite frequent and most browsers do the escaping inherentyl.
So +1 for escaping spaces per default.

A little problem: The patch replaces spaces by %20 in all positions but in the query part (after ?) a space is usually encoded as '+'.
                
> better url-normalizer basic
> ---------------------------
>
>                 Key: NUTCH-1098
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1098
>             Project: Nutch
>          Issue Type: Improvement
>          Components: fetcher
>    Affects Versions: 1.3
>         Environment: Any
>            Reporter: Radim Kolar
>            Assignee: Markus Jelsma
>              Labels: encoding, url
>             Fix For: 1.4
>
>         Attachments: nutch.diff
>
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> Basic URL normalizer lacks 2 important features
> Encode space in URL into %20 to unbreak httpclient and possibly others who do not expect space inside URL
> Ability to decode %33 encoding in URL. This is important for avoiding duplicates

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (NUTCH-1098) better url-normalizer basic

Posted by "Radim Kolar (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-1098?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Radim Kolar updated NUTCH-1098:
-------------------------------

    Attachment: patch-urlnormalizer.diff

Do not decode # and / characters during %XX decoding. Unit tests added.
                
> better url-normalizer basic
> ---------------------------
>
>                 Key: NUTCH-1098
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1098
>             Project: Nutch
>          Issue Type: Improvement
>          Components: fetcher
>    Affects Versions: 1.3
>         Environment: Any
>            Reporter: Radim Kolar
>            Assignee: Markus Jelsma
>              Labels: encoding, url
>             Fix For: 1.5
>
>         Attachments: patch-urlnormalizer.diff
>
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> Basic URL normalizer lacks 2 important features
> Encode space in URL into %20 to unbreak httpclient and possibly others who do not expect space inside URL
> Ability to decode %33 encoding in URL. This is important for avoiding duplicates

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Closed] (NUTCH-1098) better url-normalizer basic

Posted by "Markus Jelsma (Closed) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-1098?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Markus Jelsma closed NUTCH-1098.
--------------------------------

    Assignee:     (was: Markus Jelsma)
    
> better url-normalizer basic
> ---------------------------
>
>                 Key: NUTCH-1098
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1098
>             Project: Nutch
>          Issue Type: Improvement
>          Components: fetcher
>    Affects Versions: 1.3
>         Environment: Any
>            Reporter: Radim Kolar
>              Labels: encoding, url
>             Fix For: 1.5
>
>         Attachments: patch-with-utf8-encoding.diff
>
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> Basic URL normalizer lacks 2 important features
> Encode space in URL into %20 to unbreak httpclient and possibly others who do not expect space inside URL
> Ability to decode %33 encoding in URL. This is important for avoiding duplicates

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (NUTCH-1098) better url-normalizer basic

Posted by "Ferdy Galema (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-1098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13144132#comment-13144132 ] 

Ferdy Galema commented on NUTCH-1098:
-------------------------------------

Like I said before, I'm up for converting spaces to %20, so that at least the fetcher will not fail. Also I think convert lowercase escapings to uppercase is a good idea. (Though I'm not complete sure if this is completely interchangable)

However I cannot commit this using the latest patch, because obviously it is intertwined with other changes. Why is the url decoded before changes spaces into %20 whereafter it is encoded again? Why "Remove % encoding from URL in range 0x20-0x80 exclusive / and # are not decoded" and why "// this pattern tries to find spots like "%34" in the url"?

I think changes to default filtering/normalizing should be very well thought out, because like Markus said the impact can be potentially very big. In short, I'm not able to commit anything as of now. (But if anyone else is, don't be stopped by me). Just my 2 cents.
                
> better url-normalizer basic
> ---------------------------
>
>                 Key: NUTCH-1098
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1098
>             Project: Nutch
>          Issue Type: Improvement
>          Components: fetcher
>    Affects Versions: 1.3
>         Environment: Any
>            Reporter: Radim Kolar
>            Assignee: Markus Jelsma
>              Labels: encoding, url
>             Fix For: 1.5
>
>         Attachments: patch-with-utf8-encoding.diff
>
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> Basic URL normalizer lacks 2 important features
> Encode space in URL into %20 to unbreak httpclient and possibly others who do not expect space inside URL
> Ability to decode %33 encoding in URL. This is important for avoiding duplicates

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Resolved] (NUTCH-1098) better url-normalizer basic

Posted by "Radim Kolar (Resolved) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-1098?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Radim Kolar resolved NUTCH-1098.
--------------------------------

    Resolution: Invalid

Attached patch was in improper format.
                
> better url-normalizer basic
> ---------------------------
>
>                 Key: NUTCH-1098
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1098
>             Project: Nutch
>          Issue Type: Improvement
>          Components: fetcher
>    Affects Versions: 1.3
>         Environment: Any
>            Reporter: Radim Kolar
>            Assignee: Markus Jelsma
>              Labels: encoding, url
>             Fix For: 1.5
>
>         Attachments: patch-with-utf8-encoding.diff
>
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> Basic URL normalizer lacks 2 important features
> Encode space in URL into %20 to unbreak httpclient and possibly others who do not expect space inside URL
> Ability to decode %33 encoding in URL. This is important for avoiding duplicates

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (NUTCH-1098) better url-normalizer basic

Posted by "Ferdy Galema (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-1098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13142291#comment-13142291 ] 

Ferdy Galema commented on NUTCH-1098:
-------------------------------------

@Markus/Radim

I certainly do not want to nitpick about patches, but I think feedback about unnecessary changes or malformed patches should be given. Of course when applying the patch you could simply ignore or correct them, but in the end higher quality patches benefit all of us. It just makes the process of reviewing/editing/committing a lot easier.

@Radim

Do you agree that "better url-normalizer basic" is perhaps overly broad? I can probably think of tens of other improvements that fall under the scope of a better basic urlnormalizer. Discussing / managing them in separate issues is much more efficient than cramming them all into a single one.

Anyway this is not to undermine the effort of course. Keep up the good work! (And feel free to disagree)

Cheers!
                
> better url-normalizer basic
> ---------------------------
>
>                 Key: NUTCH-1098
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1098
>             Project: Nutch
>          Issue Type: Improvement
>          Components: fetcher
>    Affects Versions: 1.3
>         Environment: Any
>            Reporter: Radim Kolar
>            Assignee: Markus Jelsma
>              Labels: encoding, url
>             Fix For: 1.5
>
>         Attachments: patch-urlnormalizer.diff
>
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> Basic URL normalizer lacks 2 important features
> Encode space in URL into %20 to unbreak httpclient and possibly others who do not expect space inside URL
> Ability to decode %33 encoding in URL. This is important for avoiding duplicates

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (NUTCH-1098) better url-normalizer basic

Posted by "Chris A. Mattmann (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-1098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13144168#comment-13144168 ] 

Chris A. Mattmann commented on NUTCH-1098:
------------------------------------------

{quote}
You simply need months to discuss trivial code change. Everybody here claims to be smart like TV and hard working like black men but look at your results: mere 13 trivial commits in October. Look at my results i have 2.1 billions files indexed in 4 months.
{quote}

This will _*NOT*_ be tolerated at Apache. Take it off list. Racial slurs, insults, etc. have no place here on the list.

                
> better url-normalizer basic
> ---------------------------
>
>                 Key: NUTCH-1098
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1098
>             Project: Nutch
>          Issue Type: Improvement
>          Components: fetcher
>    Affects Versions: 1.3
>         Environment: Any
>            Reporter: Radim Kolar
>            Assignee: Markus Jelsma
>              Labels: encoding, url
>             Fix For: 1.5
>
>         Attachments: patch-with-utf8-encoding.diff
>
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> Basic URL normalizer lacks 2 important features
> Encode space in URL into %20 to unbreak httpclient and possibly others who do not expect space inside URL
> Ability to decode %33 encoding in URL. This is important for avoiding duplicates

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (NUTCH-1098) better url-normalizer basic

Posted by "Radim Kolar (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-1098?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Radim Kolar updated NUTCH-1098:
-------------------------------

    Attachment:     (was: patch-urlnormalizer.diff)
    
> better url-normalizer basic
> ---------------------------
>
>                 Key: NUTCH-1098
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1098
>             Project: Nutch
>          Issue Type: Improvement
>          Components: fetcher
>    Affects Versions: 1.3
>         Environment: Any
>            Reporter: Radim Kolar
>            Assignee: Markus Jelsma
>              Labels: encoding, url
>             Fix For: 1.5
>
>         Attachments: patch-urlnormalizer.diff
>
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> Basic URL normalizer lacks 2 important features
> Encode space in URL into %20 to unbreak httpclient and possibly others who do not expect space inside URL
> Ability to decode %33 encoding in URL. This is important for avoiding duplicates

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (NUTCH-1098) better url-normalizer basic

Posted by "Chris A. Mattmann (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-1098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13144586#comment-13144586 ] 

Chris A. Mattmann commented on NUTCH-1098:
------------------------------------------

I was told the following from an experienced ASF infra person:

{quote}
While a contribution under the AL2 cannot
legally be withdrawn, ASF policy is to abide by the wishes of the
contributor even when legally we have sufficient rights to ignore them.
{quote}

So, we will leave the patch in place for the record, but abide by your wishes
not to include it. 
                
> better url-normalizer basic
> ---------------------------
>
>                 Key: NUTCH-1098
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1098
>             Project: Nutch
>          Issue Type: Improvement
>          Components: fetcher
>    Affects Versions: 1.3
>         Environment: Any
>            Reporter: Radim Kolar
>            Assignee: Markus Jelsma
>              Labels: encoding, url
>             Fix For: 1.5
>
>         Attachments: patch-with-utf8-encoding.diff
>
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> Basic URL normalizer lacks 2 important features
> Encode space in URL into %20 to unbreak httpclient and possibly others who do not expect space inside URL
> Ability to decode %33 encoding in URL. This is important for avoiding duplicates

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (NUTCH-1098) better url-normalizer basic

Posted by "Radim Kolar (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-1098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13141233#comment-13141233 ] 

Radim Kolar commented on NUTCH-1098:
------------------------------------

1. patch --ignore-whitespace

2. only small portion of existing crawled segments will become invalid and data in crawled segments has short lifetime anyway - They are indexed and deleted

3. decoding %2f into / is good idea because ASP.NET generated redirects comes with / encoded as %2f. (this was initially included but removed due to Markus complain)

4. No, i am not going to split such trivial patch like this
                
> better url-normalizer basic
> ---------------------------
>
>                 Key: NUTCH-1098
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1098
>             Project: Nutch
>          Issue Type: Improvement
>          Components: fetcher
>    Affects Versions: 1.3
>         Environment: Any
>            Reporter: Radim Kolar
>            Assignee: Markus Jelsma
>              Labels: encoding, url
>             Fix For: 1.5
>
>         Attachments: patch-urlnormalizer.diff
>
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> Basic URL normalizer lacks 2 important features
> Encode space in URL into %20 to unbreak httpclient and possibly others who do not expect space inside URL
> Ability to decode %33 encoding in URL. This is important for avoiding duplicates

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (NUTCH-1098) better url-normalizer basic

Posted by "Chris A. Mattmann (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-1098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13144099#comment-13144099 ] 

Chris A. Mattmann commented on NUTCH-1098:
------------------------------------------

Guys: let's change the tone of this issue, OK?

Radim, thanks for your patch. Sorry that it didn't get applied or that folks tried to engage in feedback/discussion with you on it. I would encourage you to not get discouraged and I appreciate your effort in trying to contribute to the Apache Nutch project.

The committers are the ones that have to figure out how to maintain things and sometimes we get hung up on yes I'll agree less important issues. I'm going to recommend that everyone just table those at the moment and that we move forward here. 

Here are some concrete next steps:

1. Ferdy: is it possible to commit a portion of this patch that you do understand? Then we could leave the part that you don't uncommitted. This has 2 immediate goals:
  - gives Radim a good feeling for contributing to the project -- he deserves that.
  - gives us the ability to cherry pick what we understand and are willing to maintain

2. Radim: if you want to help in improving the formatting and other requested issues, great. If you don't then that's fine too. At that point though the maintenance/evolution of the patch will transition more into the Nutch folks and you might not be as involved with it unless you get on board with what the guys have decided are their code formatting and patch generation guidelines.

Thanks!




                
> better url-normalizer basic
> ---------------------------
>
>                 Key: NUTCH-1098
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1098
>             Project: Nutch
>          Issue Type: Improvement
>          Components: fetcher
>    Affects Versions: 1.3
>         Environment: Any
>            Reporter: Radim Kolar
>            Assignee: Markus Jelsma
>              Labels: encoding, url
>             Fix For: 1.5
>
>         Attachments: patch-with-utf8-encoding.diff
>
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> Basic URL normalizer lacks 2 important features
> Encode space in URL into %20 to unbreak httpclient and possibly others who do not expect space inside URL
> Ability to decode %33 encoding in URL. This is important for avoiding duplicates

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (NUTCH-1098) better url-normalizer basic

Posted by "Radim Kolar (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-1098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13128264#comment-13128264 ] 

Radim Kolar commented on NUTCH-1098:
------------------------------------

Browsers seems to send spaces in URL encoded like this:
GET /index.jsp?q=wha%20just%20happened HTTP/1.0
                
> better url-normalizer basic
> ---------------------------
>
>                 Key: NUTCH-1098
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1098
>             Project: Nutch
>          Issue Type: Improvement
>          Components: fetcher
>    Affects Versions: 1.3
>         Environment: Any
>            Reporter: Radim Kolar
>            Assignee: Markus Jelsma
>              Labels: encoding, url
>             Fix For: 1.4
>
>         Attachments: nutch.diff
>
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> Basic URL normalizer lacks 2 important features
> Encode space in URL into %20 to unbreak httpclient and possibly others who do not expect space inside URL
> Ability to decode %33 encoding in URL. This is important for avoiding duplicates

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (NUTCH-1098) better url-normalizer basic

Posted by "Radim Kolar (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-1098?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Radim Kolar updated NUTCH-1098:
-------------------------------

    Attachment: patch-urlnormalizer.diff
    
> better url-normalizer basic
> ---------------------------
>
>                 Key: NUTCH-1098
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1098
>             Project: Nutch
>          Issue Type: Improvement
>          Components: fetcher
>    Affects Versions: 1.3
>         Environment: Any
>            Reporter: Radim Kolar
>            Assignee: Markus Jelsma
>              Labels: encoding, url
>             Fix For: 1.4
>
>         Attachments: nutch.diff, patch-urlnormalizer.diff
>
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> Basic URL normalizer lacks 2 important features
> Encode space in URL into %20 to unbreak httpclient and possibly others who do not expect space inside URL
> Ability to decode %33 encoding in URL. This is important for avoiding duplicates

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (NUTCH-1098) better url-normalizer basic

Posted by "Radim Kolar (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-1098?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Radim Kolar updated NUTCH-1098:
-------------------------------

    Attachment: patch-with-utf8-encoding.diff

Added support for encoding string to UTF-8 and then URL %escaping it.
                
> better url-normalizer basic
> ---------------------------
>
>                 Key: NUTCH-1098
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1098
>             Project: Nutch
>          Issue Type: Improvement
>          Components: fetcher
>    Affects Versions: 1.3
>         Environment: Any
>            Reporter: Radim Kolar
>            Assignee: Markus Jelsma
>              Labels: encoding, url
>             Fix For: 1.5
>
>         Attachments: patch-with-utf8-encoding.diff
>
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> Basic URL normalizer lacks 2 important features
> Encode space in URL into %20 to unbreak httpclient and possibly others who do not expect space inside URL
> Ability to decode %33 encoding in URL. This is important for avoiding duplicates

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (NUTCH-1098) better url-normalizer basic

Posted by "Radim Kolar (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-1098?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Radim Kolar updated NUTCH-1098:
-------------------------------

    Attachment: nutch.diff

Updated patch. It also normalizes unprintable % sequences to upper case. Like %1f -> %1F . upper case seems to be way more common.

> better url-normalizer basic
> ---------------------------
>
>                 Key: NUTCH-1098
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1098
>             Project: Nutch
>          Issue Type: Improvement
>          Components: fetcher
>    Affects Versions: 1.3
>         Environment: Any
>            Reporter: Radim Kolar
>              Labels: encoding, url
>             Fix For: 1.4
>
>         Attachments: nutch.diff
>
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> Basic URL normalizer lacks 2 important features
> Encode space in URL into %20 to unbreak httpclient and possibly others who do not expect space inside URL
> Ability to decode %33 encoding in URL. This is important for avoiding duplicates

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (NUTCH-1098) better url-normalizer basic

Posted by "Radim Kolar (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-1098?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Radim Kolar updated NUTCH-1098:
-------------------------------

    Attachment:     (was: patch-with-utf8-encoding.diff)
    
> better url-normalizer basic
> ---------------------------
>
>                 Key: NUTCH-1098
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1098
>             Project: Nutch
>          Issue Type: Improvement
>          Components: fetcher
>    Affects Versions: 1.3
>         Environment: Any
>            Reporter: Radim Kolar
>            Assignee: Markus Jelsma
>              Labels: encoding, url
>             Fix For: 1.5
>
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> Basic URL normalizer lacks 2 important features
> Encode space in URL into %20 to unbreak httpclient and possibly others who do not expect space inside URL
> Ability to decode %33 encoding in URL. This is important for avoiding duplicates

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira