You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@spamassassin.apache.org by bu...@bugzilla.spamassassin.org on 2004/01/28 18:56:17 UTC
[Bug 2978] New: patch to add support for fuzzy matching
http://bugzilla.spamassassin.org/show_bug.cgi?id=2978
Summary: patch to add support for fuzzy matching
Product: Spamassassin
Version: 2.63
Platform: All
OS/Version: All
Status: NEW
Severity: enhancement
Priority: P5
Component: Rules (Eval Tests)
AssignedTo: spamassassin-dev@incubator.apache.org
ReportedBy: sgunderson@bigfoot.com
(For the reference, this is Debian bug <a
href="http://bugs.debian.org/230097">#230097</a>.
The included patch includes support for fuzzy matching of words in the
subject, so for instance `vi1qgra' is matched against 'viagra' (within
limits -- it won't take `\/1g|^a' for instance, that's too far away).
The algorithm itself is standard edit-distance (as used in spelling
checkers, for instance), with the following modifications:
- All special characters (ie. non-alphanumeric) are changed into
spaces and compressed before checking.
- Adding a character (ie. "vigra" -> "viagra") has cost 1.
- Replacing a charachter (ie. "vi4gra" -> "viagra") also has cost 1.
- Deleting a character (ie. "viaagra" -> "viagra") also has cost 1,
_except_ that deleting spaces (ie. "v i a g r a" -> "viagra", or even
"v!!i!!a!!g!!r!!a" -> "viagra") has cost 0. Inserting spaces still
has cost 1, so stuff like "having a" isn't matched too easily into
"viagra". :-)
- The smallest number of such changes required to get a substring match
for the word (ie. " viagra ") is the cost; if this is under some
number (usually 2 for shorter words or 3 for slightly longer), it's a
match. 0 means the word would have matched exactly in a substring
match (that is, after removing the spaces).
Ordinary edit-distance runs in O(nm) time where n and m are the lengths
of the searched text (the subject) and the keyword (ie. "viagra" or
whatever) respectively. In order to get the substring match correctly,
I had to increase that to O(n^2*m), but since m is fairly small and
subjects seldom are over 100-120 characters, this shouldn't be a
problem, and the code seems more than efficient enough in real-world
tests.
The scoring and keywords used in the .cf file could (should) of course
be modified, they are set fairly arbitrarily, and are primarily meant
for demonstration. :-)
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.