You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@manifoldcf.apache.org by Erlend GarĂ¥sen <e....@usit.uio.no> on 2011/07/01 14:09:51 UTC
Duplicate documents and MCF
If I understand ManifoldCF correctly, a unique document is a document
with a distinct URL such as
http://www.example.org/foo/index.html
Therefore I guess that MCF treats the following document as different
compared to the example above:
http://www.example.org/foo/
After I did a huge crawl, I now have a lot of duplicate documents in my
Solr index, and I'm not quite sure how to cope with this problem. I
guess I have several options:
1) Give root urls a higher score. Then duplicates such as the first
example above will be listed further down in the search result list.
2) Filter out index.html documents, but then I do not have any guarantee
that the root url has been indexed (in case links to the documents were
only pointing to index.html.
3) Store a hashed value generated out of the documents' content in order
to give them a unique id.
Erlend
--
Erlend GarĂ¥sen
Center for Information Technology Services
University of Oslo
P.O. Box 1086 Blindern, N-0317 OSLO, Norway
Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968, VIP: 31050