You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@nutch.apache.org by Diaa Abdallah <di...@gmail.com> on 2014/04/25 11:53:24 UTC

Why are web urls not assumed to be http

Hi,
I tried injecting www.google.com into my crawldb without prepending
http://to it.
It injected it fine, however when I ran generate on it it gave the
following warning:
"Malformed URL: 'www.google.com', skipping (java.net.MalformedURLException:
no protocol: www.google.com"

Why doesn't nutch assume that web links that have www. at the beginning are
of the http protocol?

Thanks,
Diaa

Re: Why are web urls not assumed to be http

Posted by Sebastian Nagel <wa...@googlemail.com>.

Hi Diaa,

> Why doesn't nutch assume that web links that have www. at the beginning are
> of the http protocol?

It would be not a big problem to do so. The url normalizer provides scopes
(inject, fetch, etc.): you only have to point the property
"urlnormalizer.regex.file.inject" to a special regex-normalize-inject.xml
(or any other choice for the filename). In this file you can define any such
rules as described.

Why there are no such specific rules for injector?
- maybe just because no one did it or wants to maintain the rule set
  (to define a commonly accepted set of rules isn't easy:
   you can ever continue, e.g. what about adding also www. if it's missing)
- seeds are fully controlled by the crawl administrators, it's
  comparatively simple to teach them to use fully specified URLs.
  Much simpler than explaining usage of URL filters.

Sebastian

On 04/25/2014 11:53 AM, Diaa Abdallah wrote:
> Hi,
> I tried injecting www.google.com into my crawldb without prepending
> http://to it.
> It injected it fine, however when I ran generate on it it gave the
> following warning:
> "Malformed URL: 'www.google.com', skipping (java.net.MalformedURLException:
> no protocol: www.google.com"
> 
> Why doesn't nutch assume that web links that have www. at the beginning are
> of the http protocol?
> 
> Thanks,
> Diaa
>