You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Elwin <ma...@gmail.com> on 2006/03/06 04:16:03 UTC

find duplicate urls in webdb

When I read pages out of a webdb and printed out the url of each page, I
found two urls  are just the same.
Is it possible that two pages with the same url?

--
《盖世豪侠》好评如潮,让无线收视居高不下,
无线高兴之余,仍未重用。周星驰岂是池中物,
喜剧天分既然崭露,当然不甘心受冷落,于是
转投电影界,在大银幕上一展风采。无线既得
千里马,又失千里马,当然后悔莫及。

Re: find duplicate urls in webdb

Posted by Andrzej Bialecki <ab...@getopt.org>.
Elwin wrote:
> When I read pages out of a webdb and printed out the url of each page, I
> found two urls  are just the same.
> Is it possible that two pages with the same url?
>   

WebDB should not allow two URLs that are exactly the same (Nutch uses 
MD5 signature for that). Please check them carefully, most probably they 
differ only in a single character, or a whitespace.

-- 
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com