You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Pranav Prakash <pr...@gmail.com> on 2011/08/30 08:57:01 UTC

Solr 3.3. Grouping vs DeDuplication and Deduplication Use Case

Solr 3.3. has a feature "Grouping". Is it practically same as deduplication?

Here is my use case for duplicates removal -

We have many documents with similar (upto 99%) content. Upon some search
queries, almost all of them come up on first page results. Of all these
documents, essentially one is original and the other are duplicates. We are
able to find the original content on a basis of number of factors - who
uploaded it, when, how many viral shares.....It is also possible that the
duplicates are uploaded earlier (and hence exist in search index) while the
original is uploaded later (and gets added later to index).

AFAIK, Deduplication targets index time. Is there a means I can specify the
original which should be returned and the duplicates which could be removed
from coming up.?


*Pranav Prakash*

"temet nosce"

Twitter <http://twitter.com/pranavprakash> | Blog <http://blog.myblive.com> |
Google <http://www.google.com/profiles/pranny>

Re: Solr 3.3. Grouping vs DeDuplication and Deduplication Use Case

Posted by Marc Sturlese <ma...@gmail.com>.
Deduplication uses lucene indexWriter.updateDocument using the signature
term. I don't think it's possible as a default feature to choose wich
document to index, the "original" should be always the last to be indexed.
/IndexWriter.updateDocument
Updates a document by first deleting the document(s) containing term and
then adding the new document. The delete and then add are atomic as seen by
a reader on the same index (flush may happen only after the add)./

With grouping you have all your documents indexed so it gives you more
flexibility

--
View this message in context: http://lucene.472066.n3.nabble.com/Solr-3-3-Grouping-vs-DeDuplication-and-Deduplication-Use-Case-tp3294711p3295023.html
Sent from the Solr - User mailing list archive at Nabble.com.