You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-commits@lucene.apache.org by Apache Wiki <wi...@apache.org> on 2012/10/30 08:16:53 UTC

[Solr Wiki] Update of "TextProfileSignature" by JoelNothman

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Solr Wiki" for change notification.

The "TextProfileSignature" page has been changed by JoelNothman:
http://wiki.apache.org/solr/TextProfileSignature

Comment:
description of algorithm

New page:
TextProfileSignature calculates a fuzzy hash of textual fields for [[Deduplication]], and may be incorporated using a SignatureUpdateProcessorFactory definition including the following parameters:

|| Name || Type || Description || Default value ||
|| `minTokenLen` || int || The minimum token length to consider || 2 ||
|| `quantRate` || float || When multiplied by the maximum token frequency, this determines count quantization || .01 ||

The signature calculation proceeds as follows:

=== Tokenization and normalization ===

* Tokens are contiguous alphanumeric characters
* Normalized to lowercase
* Discarded if shorter than `minTokenLen`

Tokens are then counted, tracking the frequency `maxFreq` of the most frequent token.

=== Count quantization ===

A value `quant` is calculated as follows:

|| || 1 || if `maxFreq` <= 1 ||
||`quant` := || 2 || if round(`maxFreq * quantRate`) < 2 ||
|| || round(`maxFreq * quantRate`) || otherwise ||

Token frequencies are then rounded down to the nearest multiple of `quant`, and any token occurring less than `quant` times is discarded.

=== Hashing ===

The set of frequencies is transformed to a string as a space-delimited sequence of tokens and their frequencies, in descending frequency order. This is then MD5-hashed.

See also [[http://lucene.apache.org/solr/api-4_0_0-BETA/org/apache/solr/update/processor/TextProfileSignature.html|TextProfileSignature's javadoc]]