You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by vishal vachhani <vi...@gmail.com> on 2008/09/21 18:54:17 UTC

Duplicate pages in result of queries

Hi,

Is this bug or I m missing something ?

I have crawled many urls using Nutch-0.9. When I query the index created
using the crawl, some results are duplicate.

How nutch decides the urls are duplicate ? Is it on URL string matching or
based on content of pages?

for example content of the pages are same but urls are not same because of
"/","//" and "///".

http://www.indianholiday.com/india-wildlife-holidays/index.html
                                         ^^^
http://www.indianholiday.com//india-wildlife-holidays/index.html
                                         ^^^^
http://www.indianholiday.com///india-wildlife-holidays/index.html
                                          ^^^^

Any idea how to remove this kind of duplicate pages from the crawl.

Thanks in advance!!

-- 
Thanks and Regards,
Vishal Vachhani