You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@nutch.apache.org by "Marc Brette (JIRA)" <ji...@apache.org> on 2007/08/23 16:56:30 UTC

[jira] Created: (NUTCH-546) file URL are filtered out by the crawler

file URL are filtered out by the crawler
----------------------------------------

                 Key: NUTCH-546
                 URL: https://issues.apache.org/jira/browse/NUTCH-546
             Project: Nutch
          Issue Type: Bug
          Components: fetcher
    Affects Versions: 1.0.0
         Environment: Windows XP
Nutch trunk from Monday, August 20th 2007
            Reporter: Marc Brette


I tried to index file system using the file:/ protocol, which worked fine in version 0.9
The file URL are being filtered out and not fetched at all.

I investigated the code and saw that there are 2 issues:
1) One is with the class UrlValidator: when validating an URL, it check the 'authority', a combination of host and port. As it is null for file, the URL is rejected.
2) Once this check is removed, files that contain space characters (and maybe other characters to be URL encoded) are also filtered out. It maybe be because the file protocol plugin doesn't URL encode space characters and/or UrlValidator is enforce the rule to encode such character.

To workaround these issues, I just turned all UrlValidator checks and it works fine.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (NUTCH-546) file URL are filtered out by the crawler

Posted by "Doğacan Güney (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/NUTCH-546?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12523502 ] 

Doğacan Güney commented on NUTCH-546:
-------------------------------------

Btw, I also realized that a url like "http://localhost/" is considered invalid (even after this patch). Because UrlValidator assumes that   
host-part of the url should have a '.' in it (among other things). How severe is this? I really like the fact that we can filter out urls like the one I mentioned (web is filled with such urls). However, this may be a big problem for people who may want to crawl their own computers or their own domains, I guess...

> file URL are filtered out by the crawler
> ----------------------------------------
>
>                 Key: NUTCH-546
>                 URL: https://issues.apache.org/jira/browse/NUTCH-546
>             Project: Nutch
>          Issue Type: Bug
>          Components: fetcher
>    Affects Versions: 1.0.0
>         Environment: Windows XP
> Nutch trunk from Monday, August 20th 2007
>            Reporter: Marc Brette
>            Assignee: Doğacan Güney
>         Attachments: NUTCH-546.patch
>
>
> I tried to index file system using the file:/ protocol, which worked fine in version 0.9
> The file URL are being filtered out and not fetched at all.
> I investigated the code and saw that there are 2 issues:
> 1) One is with the class UrlValidator: when validating an URL, it check the 'authority', a combination of host and port. As it is null for file, the URL is rejected.
> 2) Once this check is removed, files that contain space characters (and maybe other characters to be URL encoded) are also filtered out. It maybe be because the file protocol plugin doesn't URL encode space characters and/or UrlValidator is enforce the rule to encode such character.
> To workaround these issues, I just commented out UrlValidator checks and it works fine.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (NUTCH-546) file URL are filtered out by the crawler

Posted by "Hudson (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/NUTCH-546?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12526667 ] 

Hudson commented on NUTCH-546:
------------------------------

Integrated in Nutch-Nightly #204 (See [http://lucene.zones.apache.org:8080/hudson/job/Nutch-Nightly/204/])

> file URL are filtered out by the crawler
> ----------------------------------------
>
>                 Key: NUTCH-546
>                 URL: https://issues.apache.org/jira/browse/NUTCH-546
>             Project: Nutch
>          Issue Type: Bug
>          Components: fetcher
>    Affects Versions: 1.0.0
>         Environment: Windows XP
> Nutch trunk from Monday, August 20th 2007
>            Reporter: Marc Brette
>            Assignee: Doğacan Güney
>             Fix For: 1.0.0
>
>         Attachments: NUTCH-546-validator-plugin_v1.patch, NUTCH-546.patch
>
>
> I tried to index file system using the file:/ protocol, which worked fine in version 0.9
> The file URL are being filtered out and not fetched at all.
> I investigated the code and saw that there are 2 issues:
> 1) One is with the class UrlValidator: when validating an URL, it check the 'authority', a combination of host and port. As it is null for file, the URL is rejected.
> 2) Once this check is removed, files that contain space characters (and maybe other characters to be URL encoded) are also filtered out. It maybe be because the file protocol plugin doesn't URL encode space characters and/or UrlValidator is enforce the rule to encode such character.
> To workaround these issues, I just commented out UrlValidator checks and it works fine.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (NUTCH-546) file URL are filtered out by the crawler

Posted by "Marc Brette (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/NUTCH-546?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12522987 ] 

Marc Brette commented on NUTCH-546:
-----------------------------------

Thanks. I like the normalizer approach.

> file URL are filtered out by the crawler
> ----------------------------------------
>
>                 Key: NUTCH-546
>                 URL: https://issues.apache.org/jira/browse/NUTCH-546
>             Project: Nutch
>          Issue Type: Bug
>          Components: fetcher
>    Affects Versions: 1.0.0
>         Environment: Windows XP
> Nutch trunk from Monday, August 20th 2007
>            Reporter: Marc Brette
>
> I tried to index file system using the file:/ protocol, which worked fine in version 0.9
> The file URL are being filtered out and not fetched at all.
> I investigated the code and saw that there are 2 issues:
> 1) One is with the class UrlValidator: when validating an URL, it check the 'authority', a combination of host and port. As it is null for file, the URL is rejected.
> 2) Once this check is removed, files that contain space characters (and maybe other characters to be URL encoded) are also filtered out. It maybe be because the file protocol plugin doesn't URL encode space characters and/or UrlValidator is enforce the rule to encode such character.
> To workaround these issues, I just commented out UrlValidator checks and it works fine.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (NUTCH-546) file URL are filtered out by the crawler

Posted by "Marc Brette (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/NUTCH-546?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Marc Brette updated NUTCH-546:
------------------------------

    Description: 
I tried to index file system using the file:/ protocol, which worked fine in version 0.9
The file URL are being filtered out and not fetched at all.

I investigated the code and saw that there are 2 issues:
1) One is with the class UrlValidator: when validating an URL, it check the 'authority', a combination of host and port. As it is null for file, the URL is rejected.
2) Once this check is removed, files that contain space characters (and maybe other characters to be URL encoded) are also filtered out. It maybe be because the file protocol plugin doesn't URL encode space characters and/or UrlValidator is enforce the rule to encode such character.

To workaround these issues, I just commented out UrlValidator checks and it works fine.

  was:
I tried to index file system using the file:/ protocol, which worked fine in version 0.9
The file URL are being filtered out and not fetched at all.

I investigated the code and saw that there are 2 issues:
1) One is with the class UrlValidator: when validating an URL, it check the 'authority', a combination of host and port. As it is null for file, the URL is rejected.
2) Once this check is removed, files that contain space characters (and maybe other characters to be URL encoded) are also filtered out. It maybe be because the file protocol plugin doesn't URL encode space characters and/or UrlValidator is enforce the rule to encode such character.

To workaround these issues, I just turned all UrlValidator checks and it works fine.


> file URL are filtered out by the crawler
> ----------------------------------------
>
>                 Key: NUTCH-546
>                 URL: https://issues.apache.org/jira/browse/NUTCH-546
>             Project: Nutch
>          Issue Type: Bug
>          Components: fetcher
>    Affects Versions: 1.0.0
>         Environment: Windows XP
> Nutch trunk from Monday, August 20th 2007
>            Reporter: Marc Brette
>
> I tried to index file system using the file:/ protocol, which worked fine in version 0.9
> The file URL are being filtered out and not fetched at all.
> I investigated the code and saw that there are 2 issues:
> 1) One is with the class UrlValidator: when validating an URL, it check the 'authority', a combination of host and port. As it is null for file, the URL is rejected.
> 2) Once this check is removed, files that contain space characters (and maybe other characters to be URL encoded) are also filtered out. It maybe be because the file protocol plugin doesn't URL encode space characters and/or UrlValidator is enforce the rule to encode such character.
> To workaround these issues, I just commented out UrlValidator checks and it works fine.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Assigned: (NUTCH-546) file URL are filtered out by the crawler

Posted by "Doğacan Güney (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/NUTCH-546?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Doğacan Güney reassigned NUTCH-546:
-----------------------------------

    Assignee: Doğacan Güney

> file URL are filtered out by the crawler
> ----------------------------------------
>
>                 Key: NUTCH-546
>                 URL: https://issues.apache.org/jira/browse/NUTCH-546
>             Project: Nutch
>          Issue Type: Bug
>          Components: fetcher
>    Affects Versions: 1.0.0
>         Environment: Windows XP
> Nutch trunk from Monday, August 20th 2007
>            Reporter: Marc Brette
>            Assignee: Doğacan Güney
>         Attachments: NUTCH-546.patch
>
>
> I tried to index file system using the file:/ protocol, which worked fine in version 0.9
> The file URL are being filtered out and not fetched at all.
> I investigated the code and saw that there are 2 issues:
> 1) One is with the class UrlValidator: when validating an URL, it check the 'authority', a combination of host and port. As it is null for file, the URL is rejected.
> 2) Once this check is removed, files that contain space characters (and maybe other characters to be URL encoded) are also filtered out. It maybe be because the file protocol plugin doesn't URL encode space characters and/or UrlValidator is enforce the rule to encode such character.
> To workaround these issues, I just commented out UrlValidator checks and it works fine.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Resolved: (NUTCH-546) file URL are filtered out by the crawler

Posted by "Doğacan Güney (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/NUTCH-546?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Doğacan Güney resolved NUTCH-546.
---------------------------------

       Resolution: Fixed
    Fix Version/s: 1.0.0

Committed in rev. 574346.

Note that UrlValidator is now a plugin (urlfilter-validator) and is not enabled by default.

> file URL are filtered out by the crawler
> ----------------------------------------
>
>                 Key: NUTCH-546
>                 URL: https://issues.apache.org/jira/browse/NUTCH-546
>             Project: Nutch
>          Issue Type: Bug
>          Components: fetcher
>    Affects Versions: 1.0.0
>         Environment: Windows XP
> Nutch trunk from Monday, August 20th 2007
>            Reporter: Marc Brette
>            Assignee: Doğacan Güney
>             Fix For: 1.0.0
>
>         Attachments: NUTCH-546-validator-plugin_v1.patch, NUTCH-546.patch
>
>
> I tried to index file system using the file:/ protocol, which worked fine in version 0.9
> The file URL are being filtered out and not fetched at all.
> I investigated the code and saw that there are 2 issues:
> 1) One is with the class UrlValidator: when validating an URL, it check the 'authority', a combination of host and port. As it is null for file, the URL is rejected.
> 2) Once this check is removed, files that contain space characters (and maybe other characters to be URL encoded) are also filtered out. It maybe be because the file protocol plugin doesn't URL encode space characters and/or UrlValidator is enforce the rule to encode such character.
> To workaround these issues, I just commented out UrlValidator checks and it works fine.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (NUTCH-546) file URL are filtered out by the crawler

Posted by "Andrzej Bialecki (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/NUTCH-546?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12523578 ] 

Andrzej Bialecki  commented on NUTCH-546:
-----------------------------------------

+1 - I think it's the best solution so far. Regarding the use of regex urlfilter - for large crawls it's often unsuitable, because of totally bizarre urls you may encounter in the wild, which can tie the regex engine in knots. Example: URL that is full of control characters and is > 10kB long (it's a real case) ...

> file URL are filtered out by the crawler
> ----------------------------------------
>
>                 Key: NUTCH-546
>                 URL: https://issues.apache.org/jira/browse/NUTCH-546
>             Project: Nutch
>          Issue Type: Bug
>          Components: fetcher
>    Affects Versions: 1.0.0
>         Environment: Windows XP
> Nutch trunk from Monday, August 20th 2007
>            Reporter: Marc Brette
>            Assignee: Doğacan Güney
>         Attachments: NUTCH-546.patch
>
>
> I tried to index file system using the file:/ protocol, which worked fine in version 0.9
> The file URL are being filtered out and not fetched at all.
> I investigated the code and saw that there are 2 issues:
> 1) One is with the class UrlValidator: when validating an URL, it check the 'authority', a combination of host and port. As it is null for file, the URL is rejected.
> 2) Once this check is removed, files that contain space characters (and maybe other characters to be URL encoded) are also filtered out. It maybe be because the file protocol plugin doesn't URL encode space characters and/or UrlValidator is enforce the rule to encode such character.
> To workaround these issues, I just commented out UrlValidator checks and it works fine.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (NUTCH-546) file URL are filtered out by the crawler

Posted by "Doğacan Güney (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/NUTCH-546?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12523539 ] 

Doğacan Güney commented on NUTCH-546:
-------------------------------------

> Why don't we rely on the regexp urlfilter for removing such URL ? Is there a performance issue ?

UrlValidator is more complex than a simple regex filter.

See discussion at http://www.nabble.com/fetching-http%3A--www.variety.com-%3C-div%3E%3C-a%3E-tf3961692.html and NUTCH-505 for a bit background on UrlValidator.

I wanted UrlValidator to be in nutch core because the idea looked very simple and elegant to me: A utility class with no configuration options that just sits there and eliminates invalid urls. Since, it would eliminate urls nutch can't fetch anyway (at least that's what I thought back then) there was no point in having a configuration option to enable or disable it.

We can change UrlValidator to be a urlfilter-plugin (say, urlfilter-validator) that is disabled by default. This way, people running whoe web crawls can benefit from it without affecting others.

> file URL are filtered out by the crawler
> ----------------------------------------
>
>                 Key: NUTCH-546
>                 URL: https://issues.apache.org/jira/browse/NUTCH-546
>             Project: Nutch
>          Issue Type: Bug
>          Components: fetcher
>    Affects Versions: 1.0.0
>         Environment: Windows XP
> Nutch trunk from Monday, August 20th 2007
>            Reporter: Marc Brette
>            Assignee: Doğacan Güney
>         Attachments: NUTCH-546.patch
>
>
> I tried to index file system using the file:/ protocol, which worked fine in version 0.9
> The file URL are being filtered out and not fetched at all.
> I investigated the code and saw that there are 2 issues:
> 1) One is with the class UrlValidator: when validating an URL, it check the 'authority', a combination of host and port. As it is null for file, the URL is rejected.
> 2) Once this check is removed, files that contain space characters (and maybe other characters to be URL encoded) are also filtered out. It maybe be because the file protocol plugin doesn't URL encode space characters and/or UrlValidator is enforce the rule to encode such character.
> To workaround these issues, I just commented out UrlValidator checks and it works fine.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (NUTCH-546) file URL are filtered out by the crawler

Posted by "Marc Brette (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/NUTCH-546?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12522941 ] 

Marc Brette commented on NUTCH-546:
-----------------------------------

Sounds good. The URL syntax is defined in RFC 1738 http://www.ietf.org/rfc/rfc1738.txt

I don't know the design decision behind UrlValidator, but why didn't you just instanciate the java URL class ?

As for the 'space' character issue, it is the responsibility of the protocol-file plugin to ensure correct encoding, but shouldn't we be more flexible ? 
Isn't it possible that we encounter such issue with URL in pages? (you may also claim that it would be more friendly to allow the administrator not to have to encode this character in the configuration file).

I didn't start working on a solution Doğacan, thanks for looking into this.

> file URL are filtered out by the crawler
> ----------------------------------------
>
>                 Key: NUTCH-546
>                 URL: https://issues.apache.org/jira/browse/NUTCH-546
>             Project: Nutch
>          Issue Type: Bug
>          Components: fetcher
>    Affects Versions: 1.0.0
>         Environment: Windows XP
> Nutch trunk from Monday, August 20th 2007
>            Reporter: Marc Brette
>
> I tried to index file system using the file:/ protocol, which worked fine in version 0.9
> The file URL are being filtered out and not fetched at all.
> I investigated the code and saw that there are 2 issues:
> 1) One is with the class UrlValidator: when validating an URL, it check the 'authority', a combination of host and port. As it is null for file, the URL is rejected.
> 2) Once this check is removed, files that contain space characters (and maybe other characters to be URL encoded) are also filtered out. It maybe be because the file protocol plugin doesn't URL encode space characters and/or UrlValidator is enforce the rule to encode such character.
> To workaround these issues, I just commented out UrlValidator checks and it works fine.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (NUTCH-546) file URL are filtered out by the crawler

Posted by "Doğacan Güney (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/NUTCH-546?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Doğacan Güney updated NUTCH-546:
--------------------------------

    Attachment: NUTCH-546-validator-plugin_v1.patch

Here is a patch that removes UrlValidator code from nutch code and adds it as a plugin (urlfilter-validator).

> file URL are filtered out by the crawler
> ----------------------------------------
>
>                 Key: NUTCH-546
>                 URL: https://issues.apache.org/jira/browse/NUTCH-546
>             Project: Nutch
>          Issue Type: Bug
>          Components: fetcher
>    Affects Versions: 1.0.0
>         Environment: Windows XP
> Nutch trunk from Monday, August 20th 2007
>            Reporter: Marc Brette
>            Assignee: Doğacan Güney
>         Attachments: NUTCH-546-validator-plugin_v1.patch, NUTCH-546.patch
>
>
> I tried to index file system using the file:/ protocol, which worked fine in version 0.9
> The file URL are being filtered out and not fetched at all.
> I investigated the code and saw that there are 2 issues:
> 1) One is with the class UrlValidator: when validating an URL, it check the 'authority', a combination of host and port. As it is null for file, the URL is rejected.
> 2) Once this check is removed, files that contain space characters (and maybe other characters to be URL encoded) are also filtered out. It maybe be because the file protocol plugin doesn't URL encode space characters and/or UrlValidator is enforce the rule to encode such character.
> To workaround these issues, I just commented out UrlValidator checks and it works fine.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (NUTCH-546) file URL are filtered out by the crawler

Posted by "Doğacan Güney (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/NUTCH-546?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12522852 ] 

Doğacan Güney commented on NUTCH-546:
-------------------------------------

This is true, I missed it when committing UrlValidator. I guess we can change UrlValidator to only validate (that is do the full authority check and etc) URLs with schemes http,https and ftp (is there any other?) and automatically validate anything with a different scheme. This should be done before the ASCII pattern check since files can have non-ascii characters in them so this check has to go too.

So, it can go like this:

1) Make sure that the url has ":/" in it. This will be helpful in eliminating noise (I think all urls must have a scheme part).
2) Check if url starts with http,https or ftp. If it doesn't start with any, return true (to indicate that the url is valid).
3) If the url starts with one, run the validation code.

(It would be nice if we had some way of running some sort of validation on file:/'s and other protocols but I don't know if there are rules for such protocols.)

Does this sound good? I will send a patch for it soon. (Or, are you already working on this, Marc?)

> file URL are filtered out by the crawler
> ----------------------------------------
>
>                 Key: NUTCH-546
>                 URL: https://issues.apache.org/jira/browse/NUTCH-546
>             Project: Nutch
>          Issue Type: Bug
>          Components: fetcher
>    Affects Versions: 1.0.0
>         Environment: Windows XP
> Nutch trunk from Monday, August 20th 2007
>            Reporter: Marc Brette
>
> I tried to index file system using the file:/ protocol, which worked fine in version 0.9
> The file URL are being filtered out and not fetched at all.
> I investigated the code and saw that there are 2 issues:
> 1) One is with the class UrlValidator: when validating an URL, it check the 'authority', a combination of host and port. As it is null for file, the URL is rejected.
> 2) Once this check is removed, files that contain space characters (and maybe other characters to be URL encoded) are also filtered out. It maybe be because the file protocol plugin doesn't URL encode space characters and/or UrlValidator is enforce the rule to encode such character.
> To workaround these issues, I just commented out UrlValidator checks and it works fine.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (NUTCH-546) file URL are filtered out by the crawler

Posted by "Marc Brette (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/NUTCH-546?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12523515 ] 

Marc Brette commented on NUTCH-546:
-----------------------------------

Why don't we rely on the regexp urlfilter for removing such URL ? Is there a performance issue ?

> file URL are filtered out by the crawler
> ----------------------------------------
>
>                 Key: NUTCH-546
>                 URL: https://issues.apache.org/jira/browse/NUTCH-546
>             Project: Nutch
>          Issue Type: Bug
>          Components: fetcher
>    Affects Versions: 1.0.0
>         Environment: Windows XP
> Nutch trunk from Monday, August 20th 2007
>            Reporter: Marc Brette
>            Assignee: Doğacan Güney
>         Attachments: NUTCH-546.patch
>
>
> I tried to index file system using the file:/ protocol, which worked fine in version 0.9
> The file URL are being filtered out and not fetched at all.
> I investigated the code and saw that there are 2 issues:
> 1) One is with the class UrlValidator: when validating an URL, it check the 'authority', a combination of host and port. As it is null for file, the URL is rejected.
> 2) Once this check is removed, files that contain space characters (and maybe other characters to be URL encoded) are also filtered out. It maybe be because the file protocol plugin doesn't URL encode space characters and/or UrlValidator is enforce the rule to encode such character.
> To workaround these issues, I just commented out UrlValidator checks and it works fine.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (NUTCH-546) file URL are filtered out by the crawler

Posted by "Doğacan Güney (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/NUTCH-546?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12522944 ] 

Doğacan Güney commented on NUTCH-546:
-------------------------------------

> I don't know the design decision behind UrlValidator, but why didn't you just instanciate the java URL class ? 

UrlValidator is taken from Apache's commons-validator package and is ported over to nutch. We use UrlValidator because Java's URL class is not really sufficient for our needs. For example, java's URL class does not throw a MalformedURLException for a url like "http://www.example.com/a<div" (nutch's parse-js plugin goes over javascript sources to extract urls and sometimes can extract urls such as these). Another example is urls with spaces in them. Currently, nutch can't fetch (and I believe that it shouldn't fetch) a url if it has a space in it so url validation filters it. However, note that url validation runs _after_ url normalization. Url normalization is a facility to work out various quirks in urls. So one can write a url normalizer that normalizes a space to % form which nutch will fetch. 

You may think of UrlValidation as a filter that eliminates invalid urls + anything nutch can't fetch. The mistake was that we only considered protocol-http and protocol-httpclient plugins (for deciding what nutch can and can't fetch) while porting UrlValidator.

I hope this explanation helps. Feel free to add comments if you have more questions or something doesn't make sense.

> file URL are filtered out by the crawler
> ----------------------------------------
>
>                 Key: NUTCH-546
>                 URL: https://issues.apache.org/jira/browse/NUTCH-546
>             Project: Nutch
>          Issue Type: Bug
>          Components: fetcher
>    Affects Versions: 1.0.0
>         Environment: Windows XP
> Nutch trunk from Monday, August 20th 2007
>            Reporter: Marc Brette
>
> I tried to index file system using the file:/ protocol, which worked fine in version 0.9
> The file URL are being filtered out and not fetched at all.
> I investigated the code and saw that there are 2 issues:
> 1) One is with the class UrlValidator: when validating an URL, it check the 'authority', a combination of host and port. As it is null for file, the URL is rejected.
> 2) Once this check is removed, files that contain space characters (and maybe other characters to be URL encoded) are also filtered out. It maybe be because the file protocol plugin doesn't URL encode space characters and/or UrlValidator is enforce the rule to encode such character.
> To workaround these issues, I just commented out UrlValidator checks and it works fine.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (NUTCH-546) file URL are filtered out by the crawler

Posted by "Doğacan Güney (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/NUTCH-546?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12524462 ] 

Doğacan Güney commented on NUTCH-546:
-------------------------------------

OK, then. I will send a patch that 'pluginifies' UrlValidator soon.

> file URL are filtered out by the crawler
> ----------------------------------------
>
>                 Key: NUTCH-546
>                 URL: https://issues.apache.org/jira/browse/NUTCH-546
>             Project: Nutch
>          Issue Type: Bug
>          Components: fetcher
>    Affects Versions: 1.0.0
>         Environment: Windows XP
> Nutch trunk from Monday, August 20th 2007
>            Reporter: Marc Brette
>            Assignee: Doğacan Güney
>         Attachments: NUTCH-546.patch
>
>
> I tried to index file system using the file:/ protocol, which worked fine in version 0.9
> The file URL are being filtered out and not fetched at all.
> I investigated the code and saw that there are 2 issues:
> 1) One is with the class UrlValidator: when validating an URL, it check the 'authority', a combination of host and port. As it is null for file, the URL is rejected.
> 2) Once this check is removed, files that contain space characters (and maybe other characters to be URL encoded) are also filtered out. It maybe be because the file protocol plugin doesn't URL encode space characters and/or UrlValidator is enforce the rule to encode such character.
> To workaround these issues, I just commented out UrlValidator checks and it works fine.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (NUTCH-546) file URL are filtered out by the crawler

Posted by "Hudson (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/NUTCH-546?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12526363 ] 

Hudson commented on NUTCH-546:
------------------------------

Integrated in Nutch-Nightly #203 (See [http://lucene.zones.apache.org:8080/hudson/job/Nutch-Nightly/203/])

> file URL are filtered out by the crawler
> ----------------------------------------
>
>                 Key: NUTCH-546
>                 URL: https://issues.apache.org/jira/browse/NUTCH-546
>             Project: Nutch
>          Issue Type: Bug
>          Components: fetcher
>    Affects Versions: 1.0.0
>         Environment: Windows XP
> Nutch trunk from Monday, August 20th 2007
>            Reporter: Marc Brette
>            Assignee: Doğacan Güney
>             Fix For: 1.0.0
>
>         Attachments: NUTCH-546-validator-plugin_v1.patch, NUTCH-546.patch
>
>
> I tried to index file system using the file:/ protocol, which worked fine in version 0.9
> The file URL are being filtered out and not fetched at all.
> I investigated the code and saw that there are 2 issues:
> 1) One is with the class UrlValidator: when validating an URL, it check the 'authority', a combination of host and port. As it is null for file, the URL is rejected.
> 2) Once this check is removed, files that contain space characters (and maybe other characters to be URL encoded) are also filtered out. It maybe be because the file protocol plugin doesn't URL encode space characters and/or UrlValidator is enforce the rule to encode such character.
> To workaround these issues, I just commented out UrlValidator checks and it works fine.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (NUTCH-546) file URL are filtered out by the crawler

Posted by "Doğacan Güney (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/NUTCH-546?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Doğacan Güney updated NUTCH-546:
--------------------------------

    Attachment: NUTCH-546.patch

Here is a patch that implements the method described earlier. Patch also strips unnecessary whitespaces from UrlValidator and adds a main method.

> file URL are filtered out by the crawler
> ----------------------------------------
>
>                 Key: NUTCH-546
>                 URL: https://issues.apache.org/jira/browse/NUTCH-546
>             Project: Nutch
>          Issue Type: Bug
>          Components: fetcher
>    Affects Versions: 1.0.0
>         Environment: Windows XP
> Nutch trunk from Monday, August 20th 2007
>            Reporter: Marc Brette
>         Attachments: NUTCH-546.patch
>
>
> I tried to index file system using the file:/ protocol, which worked fine in version 0.9
> The file URL are being filtered out and not fetched at all.
> I investigated the code and saw that there are 2 issues:
> 1) One is with the class UrlValidator: when validating an URL, it check the 'authority', a combination of host and port. As it is null for file, the URL is rejected.
> 2) Once this check is removed, files that contain space characters (and maybe other characters to be URL encoded) are also filtered out. It maybe be because the file protocol plugin doesn't URL encode space characters and/or UrlValidator is enforce the rule to encode such character.
> To workaround these issues, I just commented out UrlValidator checks and it works fine.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.