You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Elwin <ma...@gmail.com> on 2006/03/06 04:16:03 UTC
find duplicate urls in webdb
When I read pages out of a webdb and printed out the url of each page, I
found two urls are just the same.
Is it possible that two pages with the same url?
--
《盖世豪侠》好评如潮,让无线收视居高不下,
无线高兴之余,仍未重用。周星驰岂是池中物,
喜剧天分既然崭露,当然不甘心受冷落,于是
转投电影界,在大银幕上一展风采。无线既得
千里马,又失千里马,当然后悔莫及。
Re: find duplicate urls in webdb
Posted by Andrzej Bialecki <ab...@getopt.org>.
Elwin wrote:
> When I read pages out of a webdb and printed out the url of each page, I
> found two urls are just the same.
> Is it possible that two pages with the same url?
>
WebDB should not allow two URLs that are exactly the same (Nutch uses
MD5 signature for that). Please check them carefully, most probably they
differ only in a single character, or a whitespace.
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com