You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Iain Lopata <il...@hotmail.com> on 2014/03/27 03:19:15 UTC

URL Normalization and the # sign

I need to inject a number of seed URLs that require a # sign in the
parameters.

 

Both the basic-normalizer and the regex-normalizer (with a default
configuration) would seem to remove the portion of the URL after the #

 

I can modify the regex normalizer configuration file to eliminate this
behavior, but I cannot find a way to do this for the basic normalizer
without modifying the source.

 

I have tried encoding the # sign as %23 but this does not work with the site
I am crawling - it appears to need the #.

 

Any other suggestions?  If I don't invoke the basic normalize does anyone
have the additional entries for the regex normalizer config file that would
replace its functionality?

 

Thanks

RE: URL Normalization and the # sign

Posted by Iain Lopata <il...@hotmail.com>.

Further analysis suggests that the Injector invokes the normalizer chain and
that there is no command line option to prevent this.

The Basic normalize drops the portion of the URL after the # when it
executes url.getFile() [Line 100 in BasicURLNormalizer.java version 1.6]
which does not return the fragment portion of the URL.

Does this seem correct?

Is there any other way of encoding the seed URL to prevent the fragment from
being dropped?  

Is it possible to omit the basic normalizer from the chain and implement the
same rules in the regex normalize?

-----Original Message-----
From: Iain Lopata [mailto:ilopata1@hotmail.com] 
Sent: Wednesday, March 26, 2014 9:19 PM
To: user@nutch.apache.org
Subject: URL Normalization and the # sign

I need to inject a number of seed URLs that require a # sign in the
parameters.

 

Both the basic-normalizer and the regex-normalizer (with a default
configuration) would seem to remove the portion of the URL after the #

 

I can modify the regex normalizer configuration file to eliminate this
behavior, but I cannot find a way to do this for the basic normalizer
without modifying the source.

 

I have tried encoding the # sign as %23 but this does not work with the site
I am crawling - it appears to need the #.

 

Any other suggestions?  If I don't invoke the basic normalize does anyone
have the additional entries for the regex normalizer config file that would
replace its functionality?

 

Thanks