You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by vibhoreng04 <vi...@gmail.com> on 2012/01/02 14:02:54 UTC

Re: How to run the solr dedup for the document which match 80% or match almost.

Hi,

I implemented TextProfileSignature dedupe as suggested but here is something
weired which I came through while implementing -
I am testing it with two documents and trying to index them .

Please see the below content-

<<<<<<<<<Content starts Here>>>>>>>>>>>
I bought a Toyota Camry in 2007. After driven 60000km, Test02 my engine oil
light starts flash after change engine oil and just drive 5000Km during I
use brake. I went to Toyota to ask a , it is said the normal engine Test03
oil consumption is 0.4 to 0.5L/1000Km. Test04 If so, Toyota recommends
6000Km for each engine oil change. If so, after driving 6000Km,Test05 the
engine oil consumption is 3Litre. But each time, the dealer just put 4 Litre
oil in. That means there is just 1 Litre in engine after driving
6000Km. Test06 Does anybody have standard engine oil consumption? As I
searched, even in some undeveloped countries, it is just 0.3Litre/1000Km.
<<<<<<<<<Content ends Here>>>>>>>>>>>


If i keep on adding test words like --- test01 test02 test03 in the second
document,and so on,solr still recognizes the second document as the
duplicate.But if I add any of the test word more than once(test11 or test07)
,the document count becomes 2 and the dedupe doesn't works after that.

1)Is this the default behavior or is there something to fix?

2)Can you please also tell me what is the threshold limit for dedupe?

3) Q/UANT = QUANT_RATE * maxFreq, where  QUANT_RATE is 0.01f by default, and
maxFreq is the maximum token  frequency. If maxFreq is higher than 1, then
QUANT is always higher  than 2/

Can you please clarify the above given explanation? I mean to say is
QUANT_RATE=.01f and f is less than 100 ,then how Quant rate is an integer?


Regards,

Vibhor


--
View this message in context: http://lucene.472066.n3.nabble.com/How-to-run-the-solr-dedup-for-the-document-which-match-80-or-match-almost-tp3614239p3626526.html
Sent from the Solr - User mailing list archive at Nabble.com.