You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by vishal vachhani <vi...@gmail.com> on 2008/09/21 18:54:17 UTC
Duplicate pages in result of queries
Hi,
Is this bug or I m missing something ?
I have crawled many urls using Nutch-0.9. When I query the index created
using the crawl, some results are duplicate.
How nutch decides the urls are duplicate ? Is it on URL string matching or
based on content of pages?
for example content of the pages are same but urls are not same because of
"/","//" and "///".
http://www.indianholiday.com/india-wildlife-holidays/index.html
^^^
http://www.indianholiday.com//india-wildlife-holidays/index.html
^^^^
http://www.indianholiday.com///india-wildlife-holidays/index.html
^^^^
Any idea how to remove this kind of duplicate pages from the crawl.
Thanks in advance!!
--
Thanks and Regards,
Vishal Vachhani