You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@nutch.apache.org by Apache Wiki <wi...@apache.org> on 2006/03/07 17:51:17 UTC

[Nutch Wiki] Trivial Update of "org.apache.nutch.net.BasicUrlNormalizer" by JeffRitchie

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change notification.

The following page has been changed by JeffRitchie:
http://wiki.apache.org/nutch/org%2eapache%2enutch%2enet%2eBasicUrlNormalizer

The comment on the change is:
adding page

New page:
= BasicUrlNormalizer Notes =

The Basic URL Normalizer class manipulates an URL in several ways.

 1. Trims white space from the end of the URL.  (java.lang.String.trim())
 1. may lower case protocol. (java.net.URL)
 1. if protocol is http or ftp:
  a. lower cases host.
  a. removes port if default.
  a. adds trailing slash if no file specified.
  a. removes any refrence text
  a. removes any relative paths

For example:[[BR]]
 {{{http://wiKI.apache.ORG:80/somedirectory/../DevelopmentCommandLineOptions}}}[[BR]]
would be rewriten:[[BR]]
 {{{http://wiki.apache.org/DevelopmentCommandLineOptions}}}[[BR]]

== Notes ==
Other then trimming trailing white space and the normalization performed by java.net.URL no protocols other then http and ftp are further normalized.