You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by Jack Tang <hi...@gmail.com> on 2006/03/01 03:23:38 UTC

Duplicate Content Issues

Hi

How to avoid duplicate content?
1. Mirror sites: 1 website, 2 domains.
2. Confusing the bot: dynamic URL's. As robots find dynamic content,
the site may be returning a different URL with the same content…
3. Print friendly pages?

Will nutch enhanced the dedup code?
/Jack
--
Keep Discovering ... ...
http://www.jroller.com/page/jmars

Re: Duplicate Content Issues

Posted by Jérôme Charron <je...@gmail.com>.
> How to avoid duplicate content?

You can use the org.apache.nutch.crawl.TextProfileSignature implementation
instead of the default MD5Signature or provide your own Signature
implementation.

Jérôme

--
http://motrech.free.fr/
http://www.frutch.org/