You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Stefan Neufeind <ap...@stefan-neufeind.de> on 2006/05/22 16:16:17 UTC

Applying new regex-normalizer-rules to indexed pages

Hi,

during a long fetch-run I experienced session-IDs in URLs, which was a
bit problematic. So I figured out how to write and test proper
regex-normalizer-rules (see NUTCH-279).

Now I wonder if on the next fetch-round URLs will get properly
normalized of if they are now un-normalized in the crawldb and from
there are fetched during generate without realizing the "duplicate"
(after normalization) URLs.

Also, is there a way to "clean" the page-index before actually indexing?
Our would this automatically be taken care of (does the normalizere run
again?) when performing the actual invertlinks/index/dedup?


Regards,
 Stefan