You are viewing a plain text version of this content. The canonical link for it is here.
Posted to server-dev@james.apache.org by Yu Fu <yu...@umbc.edu> on 2011/04/08 00:59:39 UTC

JAMES-1216 [gsoc2011] Proposal Update for Design and implement machine learning filters and categorization for mail

Hi Everyone,
This is Vicki. Mostly it is closed to the deadline for proposal submission.
Thank the valuable comments from Eric and Robert. I have some update


   - Implement tokenization and  stemming Pre-processions
   - Implement Light SVM algorithm to enforce the Anti-Spam filter.
   - The consideration for "out of box" design on Priority Mail filter.

I have attached the proposal below:

Proposal Title: Design and implement machine learning filters and
categorization for anti spam in the mail system
Student Name: Yu Fu
Student E-mail: yufu1@umbc.edu
Organization/Project: The Apache Software Foundation
Assigned Mentor: Eric Charles, Robert Burrell Donkin
Proposal Abstract: Using advanced machine learning methods K-nearest
neighbors and SVM to implement intelligent filter mail feature, To
strengthen anti-spam capability on the beyond of current Bayesian Analysis
method. implemented and test with Pre-processions, feature selection and ML
algorithm.
Detailed Description:
1. Understand James Anti-Spam Features
The current James Anti-Spam has three parts, which I marked with the red in
the figure below. The figure also shows the whole relationship for the basic
concepts.

   - The first part fastfail in this SMTPServer is to reject a message on
   the smtplevel if a configured hits amount is reached. It operates in the
   background of a machine, listening on a specific TCP port. In this level,
   the server will filter non-existed users, DSN filter, domains with invalid
   MX record, ignore duplicated recipients. Also IP Addresses which send emails
   to the configured recipients will get blacklisted for the configured time.
   The configuration for the certain port and hits score can be set in the
   smtpserver.xml.
   - SpamAssassin in the Mailet is to classify the messages as spam or not
   with an internal (configurable) score threshold. Usually a message will only
   be considered as spam if it matches multiple criteria; matching just a
   single test will not usually be enough to reach the threshold.
   - BayesianAnalysis in the Mailet are based on the Bayesian probability to
   classify to spam or not. It relied on the training data from the users’
   judgment. Users need to manually judge as spam and send to
   spam@thisdomain.com, oppositely, if judge as non-spam and then send to
   not.spam@thisdomain.com. BayesianAnalysisfeeder learn from these training
   dataset, and build predictive models on Bayesian probability. There will be
   a certain table for maintaining the frequency of Corpus for keywords in the
   database. Every 10 mins a thread in the BayesianAnalysis will check and
   update the table. Also the correct approach is to send the original spam or
   non-spam as an attachment to another message send to the feeder in order to
   avoid the bias from the current sender’s email header.


2. Motivation
Navie Bayes approach, which is based on Bayesian Probablity, is the most
popular method. But for some too popular Phrase or keyword, the Navie Bayes
cannot work well. Consider “Click Here to Unsubscribe” Phrase occurs 10
times as often in spam as good. P(click|spam) is 10 times higher, and
P(unsubscribe|spam) is 10 times higher Multiply together, get factor of
1000. The bias of these hot keywords will heavily hurt the accuracy of the
filter.
I want to implement K-nearest neighbor and Light SVM, which have much better
experiment [1] result than our current approach. The measurement is about
accuracy for the filtering, efficiency to build the model on training data
and test on test data.
This project will be the starting point for me to contribute to the James
Open Source community. Beyond the Anti-Spam and for the long term
consideration, I plan to design intelligent way to directly filter the mail
with the “Priority” rank or mark, which is already implemented in Gmail
system by Google. Intelligence email filter is a practical issue more than a
theoretical technique. I can contribute with the strong and scalable
analysis reasoning capability and foresight design in this community. My aim
here is to engage Machine Learning method in James Mail system efficiently
and easily.

3. Project Scope
The project will be based on the James Server 3.0 framework, and currently
it will restrict the categories to spam/not spam.

   - Implement tokenization and  stemming Pre-processions
   - Implement “TF-IDF” and “binary Vector” feature selection.
   - Implement K-nearest neighbor algorithm to enforce the new Anti-Spam
   filter.
   - Implement Light SVM algorithm to enforce the Anti-Spam filter.

4. Approach
My idea is use K-nearest neighbor algorithm to classify the mail to spam or
non-spam. Computing space distance between different text words, KNN has
better experimental performance considering the measurement of the accuracy
and efficiency [1].

Tokenization and stemming Pre-processions
This part will implement integrated with Mime4Jparser.
Tokenization: In order to separate text into individual words. For example,
the sentence “I’ am working on the proposal” will turn into “I’m”,
“working”, “on”, “the”, “proposal”.
Stop word removal: to remove common words those are usually not useful for
text classification. Example: to remove words such as “a”, “the”, “I”, “he”,
“she”, “is”, “are”, etc.
Stemming: to normalize words derived from the same root. For example:
“attending” turns into “attend” and “teacher” turns into “teach””

Feature Selection
Both of the “TF-IDF” (case insensitive) and “binary Vector” (case
insensitive) feature selection will be implemented.
TF-IDF: The tf–idf weight (term frequency–inverse document frequency) is is
a statistical measure used to evaluate how important a word is to a document
in a collection or corpus <http://en.wikipedia.org/wiki/Text_corpus>. The
importance increases proportionally to the number of times a word appears in
the document but is offset by the frequency of the word in the corpus.
Variations of the tf–idf weighting scheme are often used by search engines
as a central tool in scoring and ranking a document's relevance given a user
query. So it can avoid the bias on Bayesian approach. The variable for
TF-IDF will be tuned by the experiment, and also can manual be configured by
users.
Weighted TF-IDF: based TF-IDF, we can set different weighted, considering
the words position in the mail structure. I have assumption: the words in
message header are more important than the body.
Binary Vector: it maintains the information of each feature (term) exist or
not (1 or 0) for each training data. The main variable here is the size of
vector, we will set and consider with memory size for this case.
Variables Input: Vector Size, Ratio for TF-IDF and Weight for words position

Feature extraction
I will use mutual information in order to reduce dimension. Using Mutual
information of each word with the class variable has been used to select
important features.  Mutual information of a word captures the amount of
information gained in classifying documents by knowing the presence and
absence of the word [3].

Light SVM
Support Vector Machine is a powerful machine learning technique that is able
to work with high dimensional and noisy data. SVM use sophisticatedly kernel
functions to map data points into higher dimension and classify data in with
the hyperspace, which is the complex decision surface in this high
dimension. Determining the right parameters of such functions is not only
computationally expensive, the resulting models are also susceptible to over
fitting when the number of training examples is small due to their large VC
dimensions. I will use the optimized SVM approach: the Light SVM [5]. This
method improve traditional SVM with fast optimization algorithm, "shrinking"
heuristic, caching of kernel evaluations.
SVM method implements:
1) Use of folding in the linear select top or a certain amount document
frequency terms.
2) Create “TF-IDF” for each training data.
3) Use SVM to train a model and get predictive model. I employ the LightSVM
code implemented with C by Thorsten Joachims[5]

K-Nearest Neighbor (KNN)
This method still need feed the predictive model on the training data, which
can share the training data from Bayesian Analysis. The k-nearest neighbor’s
algorithm (kNN) is a method for
classifying<http://en.wikipedia.org/wiki/Statistical_classification>objects
based on closest training examples in the feature space. The
k-nearest neighbor algorithm is amongst the simplest of all machine learning
algorithms: an object is classified by a majority vote of its neighbors,
with the object being assigned to the class most common amongst its k
nearest neighbors (k is a positive integer, typically small). The KNN will
use the same spam/no spam training data provided by users manually. But KNN
model will be efficiently trained.
For the complexity for computing the model, it will depend on two variables
k and size of keywords [4]. For our case K is 2, because we just need
classified spam/nonspam. It will be O(N*N) to build the model since it will
compute the distance between each note., N is the size of the vector
setting.
KNN method implements:

   - Method 1: To create a “binary vector” which maintain the information of
   each feature (term) exist or not (1 or 0) for each training data. For each
   unclassified mail to create a binary vector too, then using cosine distance
   to find out the top 2 closest training data from the then find out which
   class is the unclassified mail belongs to. Compute Euclidean distance from
   target plot to those that were sampled. Order samples taking for account
   calculated distances. Choose heuristically optimal 2 nearest neighbor based
   on RMSE <http://en.wikipedia.org/wiki/RMSE> done by cross validation
   technique. Calculate an inverse distance weighted average with the k-nearest
   multivariate neighbors.
   - Method 2: Use “TF-IDF” respectively instead of binary Vector in Method
   1.

Schedule
1 week understand Mailets API
1 week implement Pre-processions
1 week implement feature selection on the training dataset
2 weeks implement KNN method (1 weeks implement and 1 week test)
2 weeks implement Light SVM method (1 weeks implement and 1 week test)
2 weeks tune the predictive model with different parameters.
1 week finish the document and guide for using and tuning.

My Bio
Yu Fu, PhD student in University of Maryland, Baltimore County
I have more than two years research experience in data mining. Now I am
designing the new secure data mining algorithm on the encrypted database,
which is funded by NSF. I implement the new regression tree, KNN and other
data mining algorithm based on the Weka API to support my experiment. I have
three publications about preserving privacy and security in data mining
approach.
[1] Anti-Spam Filter Based on Naïve Bayes, SVM, and KNN model.
irlab.csie.ntu.edu.tw/~chean/perWeb/projects/AI/AI_final.pdf
[2] An Empirical Performance Comparison of Machine Learning Methods for Spam
E-Mail Categorization.
http://www.computer.org/portal/web/csdl/doi/10.1109/ICHIS.2004.21
[3] Text Categorization Using Weight Adjusted k-Nearest Neighbor
Classification
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.109.1259&rep=rep1&type=pdf
[4] Introduction to Information Retrieval, By Christopher D. Manning,
Prabhakar Raghavan & Hinrich Schütze
http://nlp.stanford.edu/IR-book/html/htmledition/time-complexity-and-optimality-of-knn-1.html#tab:knncomp
[5] Light SVM: http://svmlight.joachims.org


Yu Fu

yufu1@umbc.edu
443-388-6654