You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Vadim Kisselmann <v....@googlemail.com> on 2011/11/07 13:29:13 UTC

Similar documents and advantages / disadvantages of MLT / Deduplication

Hello folks,

i have questions about MLT and Deduplication and what would be the best
choice in my case.

Case:

I index 1000 docs, 5 of them are 95% the same (for example: copy pasted
blog articles from different sources, with slight changes (author name,
etc..)).
But they have differences.
*Now i like to see 1 doc in my result set and the other 4 should be marked
as similar.*

With *MLT*:
<str name="mlt.fl">text</str>
          <int name="mlt.minwl">5</int>
          <int name="mlt.maxwl">50</int>
          <int name="mlt.maxqt">3</int>
          <int name="mlt.maxntp">5000</int>
          <bool name="mlt.boost">true</bool>
          <str name="mlt.qf">text</str>
   </lst>

With this config i get about 500 similar docs for this 1 doc, unfortunately
too much.


*Deduplication*:
I index this docs now with an signature and i'm using TextProfileSignature.

<updateRequestProcessorChain name="dedupe">
       <processor class="solr.processor.SignatureUpdateProcessorFactory">
         <bool name="enabled">true</bool>
         <str name="signatureField">signature_t</str>
         <bool name="overwriteDupes">false</bool>
         <str name="fields">text</str>
         <str
name="signatureClass">solr.processor.TextProfileSignature</str>
</processor>
       <processor class="solr.LogUpdateProcessorFactory" />
       <processor class="solr.RunUpdateProcessorFactory" />
     </updateRequestProcessorChain>

How can i compare the created signatures?


I want only see the 5 similar docs, nothing else.
Which of this two cases is relevant to me? Can i tune the MLT for my
requirement? Or should i use Dedupe?

Thanks and Regards
Vadim

Re: Similar documents and advantages / disadvantages of MLT / Deduplication

Posted by Chris Hostetter <ho...@fucit.org>.
: I index 1000 docs, 5 of them are 95% the same (for example: copy pasted
: blog articles from different sources, with slight changes (author name,
: etc..)).
: But they have differences.
: *Now i like to see 1 doc in my result set and the other 4 should be marked
: as similar.*

Do you actaully want al 1000 docs in your index, or do you want to prevent 
4 of the 5 copies of hte doc from being indexed?

Either way, if the the TextProfileSignature is doing a good job of 
identifying the 5 similar docs, then use that at index time.

If you want to keep 4/5 out of the index, then use the Deduplcation 
features to prefent the duplicates from being indexed and your done.  

If you wnat all docs in the index, then you have to decide how you want to 
"mark" docs as similar ... do you want to only have one of those docs 
appear in all of your results, or do you want all of them in the results 
but with an indication that there are other similar docs?  If the former: 
then take a look at "Grouping" and group on your signature field.  If the 
latter, use the MLT component, to find similar docs based on the signature 
field (ie: mlt.fl=signature_t)

https://wiki.apache.org/solr/FieldCollapsing

-Hoss