You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@spamassassin.apache.org by sp...@incubator.apache.org on 2004/08/01 00:13:01 UTC

[SpamAssassin Wiki] Updated: CeasNotesJustin

   Date: 2004-07-31T15:13:00
   Editor: JustinMason <jm...@jmason.org>
   Wiki: SpamAssassin Wiki
   Page: CeasNotesJustin
   URL: http://wiki.apache.org/spamassassin/CeasNotesJustin

   no comment

Change Log:

------------------------------------------------------------------------------
@@ -296,4 +296,37 @@
   * q from John Levine: TurnTide does exactly this technique by narrowing the TCP window on the spammer's connections.
   * q: why not just use delayed ACKs?   a: because it's not entirely as effective as the other techniques
 
+AOL hashing:
+
+  * I-Match: large corpus; lexicon generation
+  * intersection of document and lexicon gives signature
+  * trad I-Match lexicon generation: reject v frequent and hapaxes
+  * use "Mutual Information" as a measurement of fitness to avoid overlapping rules
+  * use multiple lexicons to avoid randomization from having an effect
+  * generate multiple lexicons, by removing random entries from an original lexicon
+  * also: distributional word clustering (Information Bottleneck) for lexicon selection (Terms with similar class distribution of P(spam|term))
+  * q: "'cluster' selection" -- is that reports from live users?  yep
+  * q: "FP rate?"   a: very very low
+
+Distributed, collaborative spam filtering:
+
+  * TCD, yay
+  * definition: "spam is email that the recipient is interested in receiving".  we disagree, of course ;)
+  * P2P approach
+
+Reputation network analysis for mail filtering:
+
+  * 75% of semweb data is FOAF files
+  * using web of trust
+  * a bit like http://web-o-trust.org/ , but not yet workable with email addrs since there's no spoofing protection
+
+On attacking statistical spam filters:
+
+  * spammers wanted to evade bayes
+  * tokenization/obfuscation: turn out to be good spamsigns
+  * should not have used SpamArchive spam, due to its lack of headers, in my opinion; headers improve spam recognition greatly
+  * pretty similar to http://www.cs.dal.ca/research/techreports/2004/CS-2004-06.pdf ;)
+
+
+