You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Emmanuel <jo...@gmail.com> on 2007/07/30 14:28:05 UTC

MergeSegs

I'm wondering why don't we have the option to normalize when we merge some
segments.
It should be similar as mergedb and mergelinkdb.

For instance, let's say i have two urls crawled:
http://auto.yahoo.com/index.php?auto=BMW&sort=desc
http://auto.yahoo.com/index.php?auto=BMW
The page content is the same but the display is different due to the sort
parameter. So i don't need to index twice the page.
I will then normalize the urls in order to remove some extra parameters
(sort=) and thus reduce my duplicate content i.e
http://auto.yahoo.com/index.php?auto=BMW&sort=desc will become
http://auto.yahoo.com/index.php?auto=BMW

This url normalized will be removed when i will merge my crawldb and my
linkdb. We should then do it also on the segments.
I don't see the point to keep some crawl_generate, parse_data, etc which
contains an url which doesn't exist anymore in the crawldb.

Maybe am i missing something in this case please help to understand ?