You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@manifoldcf.apache.org by Erlend GarĂ¥sen <e....@usit.uio.no> on 2011/07/01 14:09:51 UTC

Duplicate documents and MCF

If I understand ManifoldCF correctly, a unique document is a document 
with a distinct URL such as
http://www.example.org/foo/index.html

Therefore I guess that MCF treats the following document as different 
compared to the example above:
http://www.example.org/foo/

After I did a huge crawl, I now have a lot of duplicate documents in my 
Solr index, and I'm not quite sure how to cope with this problem. I 
guess I have several options:
1) Give root urls a higher score. Then duplicates such as the first 
example above will be listed further down in the search result list.
2) Filter out index.html documents, but then I do not have any guarantee 
that the root url has been indexed (in case links to the documents were 
only pointing to index.html.
3) Store a hashed value generated out of the documents' content in order 
to give them a unique id.

Erlend
-- 
Erlend GarĂ¥sen
Center for Information Technology Services
University of Oslo
P.O. Box 1086 Blindern, N-0317 OSLO, Norway
Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968, VIP: 31050