You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-commits@lucene.apache.org by Apache Wiki <wi...@apache.org> on 2012/10/30 08:16:53 UTC
[Solr Wiki] Update of "TextProfileSignature" by JoelNothman
Dear Wiki user,
You have subscribed to a wiki page or wiki category on "Solr Wiki" for change notification.
The "TextProfileSignature" page has been changed by JoelNothman:
http://wiki.apache.org/solr/TextProfileSignature
Comment:
description of algorithm
New page:
TextProfileSignature calculates a fuzzy hash of textual fields for [[Deduplication]], and may be incorporated using a SignatureUpdateProcessorFactory definition including the following parameters:
|| Name || Type || Description || Default value ||
|| `minTokenLen` || int || The minimum token length to consider || 2 ||
|| `quantRate` || float || When multiplied by the maximum token frequency, this determines count quantization || .01 ||
The signature calculation proceeds as follows:
=== Tokenization and normalization ===
* Tokens are contiguous alphanumeric characters
* Normalized to lowercase
* Discarded if shorter than `minTokenLen`
Tokens are then counted, tracking the frequency `maxFreq` of the most frequent token.
=== Count quantization ===
A value `quant` is calculated as follows:
|| || 1 || if `maxFreq` <= 1 ||
||`quant` := || 2 || if round(`maxFreq * quantRate`) < 2 ||
|| || round(`maxFreq * quantRate`) || otherwise ||
Token frequencies are then rounded down to the nearest multiple of `quant`, and any token occurring less than `quant` times is discarded.
=== Hashing ===
The set of frequencies is transformed to a string as a space-delimited sequence of tokens and their frequencies, in descending frequency order. This is then MD5-hashed.
See also [[http://lucene.apache.org/solr/api-4_0_0-BETA/org/apache/solr/update/processor/TextProfileSignature.html|TextProfileSignature's javadoc]]