You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@spamassassin.apache.org by bu...@bugzilla.spamassassin.org on 2004/01/28 18:56:17 UTC

[Bug 2978] New: patch to add support for fuzzy matching

http://bugzilla.spamassassin.org/show_bug.cgi?id=2978

           Summary: patch to add support for fuzzy matching
           Product: Spamassassin
           Version: 2.63
          Platform: All
        OS/Version: All
            Status: NEW
          Severity: enhancement
          Priority: P5
         Component: Rules (Eval Tests)
        AssignedTo: spamassassin-dev@incubator.apache.org
        ReportedBy: sgunderson@bigfoot.com


(For the reference, this is Debian bug <a
href="http://bugs.debian.org/230097">#230097</a>.

The included patch includes support for fuzzy matching of words in the
subject, so for instance `vi1qgra' is matched against 'viagra' (within
limits -- it won't take `\/1g|^a' for instance, that's too far away).
The algorithm itself is standard edit-distance (as used in spelling
checkers, for instance), with the following modifications:

 - All special characters (ie. non-alphanumeric) are changed into
   spaces and compressed before checking.
 - Adding a character (ie. "vigra" -> "viagra") has cost 1.
 - Replacing a charachter (ie. "vi4gra" -> "viagra") also has cost 1.
 - Deleting a character (ie. "viaagra" -> "viagra") also has cost 1,
   _except_ that deleting spaces (ie. "v i a g r a" -> "viagra", or even
   "v!!i!!a!!g!!r!!a" -> "viagra") has cost 0. Inserting spaces still
   has cost 1, so stuff like "having a" isn't matched too easily into
   "viagra". :-)
 - The smallest number of such changes required to get a substring match
   for the word (ie. " viagra ") is the cost; if this is under some
   number (usually 2 for shorter words or 3 for slightly longer), it's a
   match. 0 means the word would have matched exactly in a substring
   match (that is, after removing the spaces).
   
Ordinary edit-distance runs in O(nm) time where n and m are the lengths
of the searched text (the subject) and the keyword (ie. "viagra" or
whatever) respectively. In order to get the substring match correctly,
I had to increase that to O(n^2*m), but since m is fairly small and
subjects seldom are over 100-120 characters, this shouldn't be a
problem, and the code seems more than efficient enough in real-world
tests.

The scoring and keywords used in the .cf file could (should) of course
be modified, they are set fairly arbitrarily, and are primarily meant
for demonstration. :-)



------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.