You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Doğacan Güney (JIRA)" <ji...@apache.org> on 2007/08/02 14:08:53 UTC

[jira] Commented: (NUTCH-522) Use URLValidator in the Injector

    [ https://issues.apache.org/jira/browse/NUTCH-522?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12517232 ] 

Doğacan Güney commented on NUTCH-522:
-------------------------------------

> I tried with protocol-http and protocol-httpclient, i got the same error when the url contained some space.
> I'm afraid it didn't change anything. 

Actually, it is good news :). This means we can update the url pattern to exclude urls with spaces in it.

> I think you're right about the order, the normalizer should come first.

Btw, this is already what we do in ParseOutputFormat. Urls are normalized in Outlink's constructor, then validated and filtered in ParseOutputFormat. 

So, I am going to reverse validator/normalizer order in your patch and commit it soon.

> Use URLValidator in the Injector
> --------------------------------
>
>                 Key: NUTCH-522
>                 URL: https://issues.apache.org/jira/browse/NUTCH-522
>             Project: Nutch
>          Issue Type: Improvement
>          Components: injector
>            Reporter: Emmanuel Joke
>            Assignee: Emmanuel Joke
>            Priority: Minor
>             Fix For: 1.0.0
>
>         Attachments: NUTCH-522.patch, NUTCH-522_v2.patch, NUTCH-522_v3.patch
>
>
> Same as NUTCH-505, we should use the UrlValidator to check url in the Injector

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.