You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by Diaa Abdallah <di...@gmail.com> on 2014/04/25 11:53:24 UTC
Why are web urls not assumed to be http
Hi,
I tried injecting www.google.com into my crawldb without prepending
http://to it.
It injected it fine, however when I ran generate on it it gave the
following warning:
"Malformed URL: 'www.google.com', skipping (java.net.MalformedURLException:
no protocol: www.google.com"
Why doesn't nutch assume that web links that have www. at the beginning are
of the http protocol?
Thanks,
Diaa
Re: Why are web urls not assumed to be http
Posted by Sebastian Nagel <wa...@googlemail.com>.
Hi Diaa,
> Why doesn't nutch assume that web links that have www. at the beginning are
> of the http protocol?
It would be not a big problem to do so. The url normalizer provides scopes
(inject, fetch, etc.): you only have to point the property
"urlnormalizer.regex.file.inject" to a special regex-normalize-inject.xml
(or any other choice for the filename). In this file you can define any such
rules as described.
Why there are no such specific rules for injector?
- maybe just because no one did it or wants to maintain the rule set
(to define a commonly accepted set of rules isn't easy:
you can ever continue, e.g. what about adding also www. if it's missing)
- seeds are fully controlled by the crawl administrators, it's
comparatively simple to teach them to use fully specified URLs.
Much simpler than explaining usage of URL filters.
Sebastian
On 04/25/2014 11:53 AM, Diaa Abdallah wrote:
> Hi,
> I tried injecting www.google.com into my crawldb without prepending
> http://to it.
> It injected it fine, however when I ran generate on it it gave the
> following warning:
> "Malformed URL: 'www.google.com', skipping (java.net.MalformedURLException:
> no protocol: www.google.com"
>
> Why doesn't nutch assume that web links that have www. at the beginning are
> of the http protocol?
>
> Thanks,
> Diaa
>