You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Feng Ji <fe...@gmail.com> on 2006/09/01 23:38:58 UTC

same urls with only extra backslash (nutch 08)

hi,

I found there is case that two identical urls will be included in webdb. The
only difference is the with/without backslash.

saying: http://abc.com/ and http://abc.com will both appear in the dumped
webdb (one is from seeds file and the other is from the outlinkage of other
urls). Will that cause problem? such as, two identical page shown in search
stage?

thanks,

Michael,