You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by Jack Tang <hi...@gmail.com> on 2006/03/01 03:23:38 UTC
Duplicate Content Issues
Hi
How to avoid duplicate content?
1. Mirror sites: 1 website, 2 domains.
2. Confusing the bot: dynamic URL's. As robots find dynamic content,
the site may be returning a different URL with the same content…
3. Print friendly pages?
Will nutch enhanced the dedup code?
/Jack
--
Keep Discovering ... ...
http://www.jroller.com/page/jmars
Re: Duplicate Content Issues
Posted by Jérôme Charron <je...@gmail.com>.
> How to avoid duplicate content?
You can use the org.apache.nutch.crawl.TextProfileSignature implementation
instead of the default MD5Signature or provide your own Signature
implementation.
Jérôme
--
http://motrech.free.fr/
http://www.frutch.org/