You are viewing a plain text version of this content. The canonical link for it is here.

Posted to server-dev@james.apache.org by Yu Fu <yu...@umbc.edu> on 2011/04/08 15:55:00 UTC

JAMES-1216 [gsoc2011] Proposal Update again for Design and implement machine learning filters and categorization for mail

Thank Eric and Robert,
I have updated my proposal again following:
1.Feature Selection, will consider other approach here too.
2. Link to JAMEs 1216.
3. spell and writing errors.

attached:

Proposal Title: Design and implement machine learning filters and
categorization for anti spam functions in the mail system
Student Name: Yu Fu
Student E-mail: yufu1@umbc.edu
Organization/Project: The Apache Software Foundation
Assigned Mentor: Eric Charles, Robert Burrell Donkin
Proposal Abstract: Using advanced machine learning methods, K-nearest
neighbors, and SVM to implement intelligent filter mail feature, and
adding and testing with Pre-processions, feature selection and ML
algorithm, in order to strengthen anti-spam capability beyond current
Bayesian Analysis method.
Detailed Description:
1. Understand James Anti-Spam Features
The current James Anti-Spam system has three parts, which I marked
with red in the figure below. The figure also shows the whole
relationship between the basic concepts.
• The first part fastfail in this SMTPServer is to reject a message on
the smtplevel if a configured hits amount is reached. It operates in
the background of a machine, listening on a specific TCP port. In this
level, the server will filter non-existent users, DSN filter, domains
with invalid MX record, and ignore duplicated recipients. Also IP
Addresses which send emails to the configured recipients will get
blacklisted for the configured time. The configuration for the certain
port and hits score can be set in the smtpserver.xml.
• SpamAssassin in the Mailet is designed to classify the messages as
spam or not with an internal (configurable) score threshold. Usually a
message will only be considered as spam if it matches multiple
criteria; matching just a single test will not usually be enough to
reach the threshold.
• BayesianAnalysis in the Mailet uses Bayesian probability to classify
mail as spam or not spam. It relies on the training data coming from
the users’ judgment. Users need to manually judge as spam and send to
spam@thisdomain.com, oppositely, if not spam they then send to
not.spam@thisdomain.com. BayesianAnalysisfeeder learns from this
training dataset, and build predictive models based on Bayesian
probability. There will be a certain table for maintaining the
frequency of Corpus for keywords in the database. Every 10 mins a
thread in the BayesianAnalysis will check and update the table. Also,
the correct approach is to send the original spam or non-spam as an
attachment to another message sent to the feeder in order to avoid
bias from the current sender’s email header.

2. Motivation
Navie Bayes’ approach, which is based on Bayesian Probability, is the
most popular method. But for some too popular Phrases or keywords, the
Navie Bayes cannot work well. Consider the “Click Here to Unsubscribe”
Phrase occurs 10 times as often in spam as good mail. P(click|spam) is
10 times higher, and P(unsubscribe|spam) is 10 times higher. Multiply
together, and one gets a factor of 1000. The bias of these hot
keywords will heavily hurt the accuracy of the filter.
I want to implement K-nearest neighbor and Light SVM, which have much
better experiment [1] results than our current approach. The
measurement is accuracy of filtering, efficiency to build the model
based on training data and test on test data.
This project will be the starting point for me to contribute to the
James Open Source community. Beyond Anti-Spam functions and for the
long term consideration, I plan to design intelligent ways to directly
filter mail with the “Priority” rank or mark, which is already
implemented in the Gmail system by Google. Intelligent email filtering
is a practical issue more than a theoretical technique. I can
contribute with strong and scalable analysis reasoning capability and
foresight design in this community. My aim here is to engage Machine
Learning methods in James Mail systems efficiently and easily.
3. Project Scope
The project will be based on the James Server 3.0 framework, and
currently it will restrict the categories to spam/not spam.
• Implement tokenization and stemming Pre-processions
• Implement “TF-IDF” and “binary Vector” feature selection.
• Implement K-nearest neighbor algorithm to enforce the new Anti-Spam filter.
• Implement Light SVM algorithm to enforce the Anti-Spam filter.
4. Approach
My idea is to use K-nearest neighbor algorithm to classify the mail as
spam or non-spam. Computing space distance between different text
words, KNN has better experimental performance in accuracy and
efficiency [1].

Tokenization and stemming Pre-processions
This part will implement integrated with Mime4Jparser.
Tokenization: In order to separate text into individual words. For
example, the sentence “I’m working on the proposal” will turn into
“I’m”, “working”, “on”, “the”, “proposal”.
Stop word removal: to remove common words that are usually not useful
for text classification. Example: to remove words such as “a”, “the”,
“I”, “he”, “she”, “is”, “are”, etc.
Stemming: to normalize words derived from the same root. For example:
“attending” turns into “attend” and “teacher” turns into “teach””

Feature Selection
Both of the “TF-IDF” (case insensitive) and “binary Vector” (case
insensitive) feature selection will be implemented.
TF-IDF: The tf–idf weight (term frequency–inverse document frequency)
is is a statistical measure used to evaluate how important a word is
to a document in a collection or corpus. The importance increases
proportionally to the number of times a word appears in the document
but is offset by the frequency of the word in the corpus. Variations
of the tf–idf weighting scheme are often used by search engines as a
central tool in scoring and ranking a document's relevance given a
user query. So it can avoid the bias in the Bayesian approach. The
variable for TF-IDF will be tuned by the experiment, and also can be
manually configured by users.
Weighted TF-IDF: Based on TF-IDF, we can set different weights,
considering the words position in the mail structure. I have an
assumption: the words in the message header are more important than
the body.
Binary Vector: It maintains the information for whether each feature
(term) exists or not (1 or 0) for each training data. The main
variable here is the size of vector, which we will set and consider
with memory size for this case.
Variables Input: Vector Size, Ratio for TF-IDF and Weight for words position
Feature extraction
Good ideas do not always scale. In my approach the good Feature
Construction will promise with efficiency in the model building and
accuracy for the model prediction. Several approaches will be tested
for this project, including mutual information [3], PCA[6] and Linear
Discriminate Analysis [7], which are used as dimensionality reduction
for the classification. Using Mutual information of each word with the
class variable has been used to select important features. Mutual
information of a word captures the amount of information gained in
classifying documents by knowing the presence and absence of the
word.PCA is the simplest of the true eigenvector-based multivariate
analyses. LDA assumes each class is Gaussian distributed with
different means but the same covariance. It's probably a good idea to
plot some of the data beforehand to make sure that this assumption is
reasonable.

SVM
Support Vector Machine is a powerful machine learning technique that
is able to work with high dimensional and noisy data. SVM use
sophisticated kernel functions to map data points into a higher
dimension and classify data in hyperspace, which is the complex
decision surface in this high dimension. I will use the LibSVM[5] for
the implementation.
SVM method implements:
1) Use of folding in the linear select top or a certain amount of
document frequency terms.
2) Create “TF-IDF” for each training data.
3) Use SVM to train a model and get a predictive model.

K-Nearest Neighbor (KNN)
This method still needs to feed the predictive model in the training
data, which can share the training data from Bayesian Analysis. The
k-nearest neighbor’s algorithm (kNN) is a method for classifying
objects based on closest training examples in the feature space. The
k-nearest neighbor algorithm is amongst the simplest of all machine
learning algorithms: an object is classified by a majority vote of its
neighbors, with the object being assigned to the class most common
amongst its k nearest neighbors (k is a positive integer, typically
small). The KNN will use the same spam/no spam training data provided
by users manually. But KNN model will be efficiently trained.
For the complexity for computing the model, it will depend on two
variables k and size of keywords [4]. For our case K is 2, because we
just need classified spam/nonspam. It will be O(N*N) to build the
model, since it will compute the distance between each note., N is the
size of the vector setting.
KNN method implements:
• Method 1: To create a “binary vector,” which maintains the
information for whether each feature (term) exists or not (1 or 0) for
each training data. For each unclassified mail to create a binary
vector too, then use cosine distance to find out the top 2 closest
training data from the. Then find out which class the unclassified
mail belongs to. Compute Euclidean distance from target plot to those
that were sampled. Order samples taking into account calculated
distances. Choose heuristically optimal 2 nearest neighbors based on
RMSE done by cross validation technique. Calculate an inverse distance
weighted average with the k-nearest multivariate neighbors.
• Method 2: Use “TF-IDF” respectively instead of binary Vector in Method 1.
Schedule
1 week understand Mailets API
1 week implement Pre-processions
1 week implement feature selection on the training dataset
2 weeks implement KNN method (1 weeks implement and 1 week test)
2 weeks implement Light SVM method (1 weeks implement and 1 week test)
2 weeks tune the predictive model with different parameters.
1 week finish the document and guide for using and tuning.

My Bio
Yu Fu, PhD student in University of Maryland, Baltimore County
I have more than two years research experience in data mining. Now I
am designing a new secure data mining algorithm in encrypted
databases, which is funded by NSF. I implement the new regression
tree, KNN and other data mining algorithm based on the Weka API to
support my experiment. I have three publications about preserving
privacy and security in data mining approach.

[1] Anti-Spam Filter Based on Naïve Bayes, SVM, and KNN model.
irlab.csie.ntu.edu.tw/~chean/perWeb/projects/AI/AI_final.pdf
[2] An Empirical Performance Comparison of Machine Learning Methods
for Spam E-Mail Categorization.
http://www.computer.org/portal/web/csdl/doi/10.1109/ICHIS.2004.21
[3] Text Categorization Using Weight Adjusted k-Nearest Neighbor
Classification http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.109.1259&rep=rep1&type=pdf
[4] Introduction to Information Retrieval, By Christopher D. Manning,
Prabhakar Raghavan & Hinrich Schütze
http://nlp.stanford.edu/IR-book/html/htmledition/time-complexity-and-optimality-of-knn-1.html#tab:knncomp
[5] LibSVM: http://www.csie.ntu.edu.tw/~cjlin/libsvm/
[6] Jolliffe, I. T. (1986). Principal Component Analysis.
Springer-Verlag. pp. 487. doi:10.1007/b98835. ISBN 978-0-387-95442-4.
[7] McLachlan (2004) Discriminant Analysis and Statistical Pattern
Recognition In: Wiley Interscience

--
Yu Fu
yufu1@umbc.edu
443-388-6654

---------------------------------------------------------------------
To unsubscribe, e-mail: server-dev-unsubscribe@james.apache.org
For additional commands, e-mail: server-dev-help@james.apache.org

Re: JAMES-1216 [gsoc2011] Proposal Update again for Design and implement machine learning filters and categorization for mail

Posted by Robert Burrell Donkin <ro...@gmail.com>.

On Fri, Apr 8, 2011 at 3:24 PM, Eric Charles <er...@apache.org> wrote:
>
>
>
> Hi Vicki,
> So, will the end user still need to send to spam@.../notspam@... during the
> training session ?

More advanced algorithms may require cross-validation to limit
over-fitting. Typically, it's better to extract the features and then
work with just the numbers. So, I'd be happy to have a Mime4J module
capable of extracting a configurable feature vector by parsing a mail
without worrying too much about wiring this in. I'd find running
against a couple of mailboxes much better than the whole
spam@.../notspam@ stuff.

Robert

---------------------------------------------------------------------
To unsubscribe, e-mail: server-dev-unsubscribe@james.apache.org
For additional commands, e-mail: server-dev-help@james.apache.org

Re: JAMES-1216 [gsoc2011] Proposal Update again for Design and implement machine learning filters and categorization for mail

Posted by Eric Charles <er...@apache.org>.



Hi Vicki,
So, will the end user still need to send to spam@.../notspam@... during 
the training session ?
Tks,
- Eric

On 8/04/2011 15:55, Yu Fu wrote:
> Thank Eric and Robert,
> I have updated my proposal again following:
> 1.Feature Selection, will consider other approach here too.
> 2. Link to JAMEs 1216.
> 3. spell and writing errors.
>
> attached:
>
> Proposal Title: Design and implement machine learning filters and
> categorization for anti spam functions in the mail system
> Student Name: Yu Fu
> Student E-mail: yufu1@umbc.edu
> Organization/Project: The Apache Software Foundation
> Assigned Mentor: Eric Charles, Robert Burrell Donkin
> Proposal Abstract: Using advanced machine learning methods, K-nearest
> neighbors, and SVM to implement intelligent filter mail feature, and
> adding and testing with Pre-processions, feature selection and ML
> algorithm, in order to strengthen anti-spam capability beyond current
> Bayesian Analysis method.
> Detailed Description:
> 1. Understand James Anti-Spam Features
> The current James Anti-Spam system has three parts, which I marked
> with red in the figure below. The figure also shows the whole
> relationship between the basic concepts.
> •	The first part fastfail in this SMTPServer is to reject a message on
> the smtplevel if a configured hits amount is reached. It operates in
> the background of a machine, listening on a specific TCP port. In this
> level, the server will filter non-existent users, DSN filter, domains
> with invalid MX record, and ignore duplicated recipients. Also IP
> Addresses which send emails to the configured recipients will get
> blacklisted for the configured time. The configuration for the certain
> port and hits score can be set in the smtpserver.xml.
> •	SpamAssassin in the Mailet is designed to classify the messages as
> spam or not with an internal (configurable) score threshold. Usually a
> message will only be considered as spam if it matches multiple
> criteria; matching just a single test will not usually be enough to
> reach the threshold.
> •	BayesianAnalysis in the Mailet uses Bayesian probability to classify
> mail as spam or not spam. It relies on the training data coming from
> the users’ judgment. Users need to manually judge as spam and send to
> spam@thisdomain.com, oppositely, if not spam they then send to
> not.spam@thisdomain.com. BayesianAnalysisfeeder learns from this
> training dataset, and build predictive models based on Bayesian
> probability. There will be a certain table for maintaining the
> frequency of Corpus for keywords in the database. Every 10 mins a
> thread in the BayesianAnalysis will check and update the table. Also,
> the correct approach is to send the original spam or non-spam as an
> attachment to another message sent to the feeder in order to avoid
> bias from the current sender’s email header.
>
> 2. Motivation
> Navie Bayes’ approach, which is based on Bayesian Probability, is the
> most popular method. But for some too popular Phrases or keywords, the
> Navie Bayes cannot work well. Consider the “Click Here to Unsubscribe”
> Phrase occurs 10 times as often in spam as good mail. P(click|spam) is
> 10 times higher, and P(unsubscribe|spam) is 10 times higher. Multiply
> together, and one gets a factor of 1000. The bias of these hot
> keywords will heavily hurt the accuracy of the filter.
> I want to implement K-nearest neighbor and Light SVM, which have much
> better experiment [1] results than our current approach. The
> measurement is accuracy of filtering, efficiency to build the model
> based on training data and test on test data.
> This project will be the starting point for me to contribute to the
> James Open Source community. Beyond Anti-Spam functions and for the
> long term consideration, I plan to design intelligent ways to directly
> filter mail with the “Priority” rank or mark, which is already
> implemented in the Gmail system by Google. Intelligent email filtering
> is a practical issue more than a theoretical technique. I can
> contribute with strong and scalable analysis reasoning capability and
> foresight design in this community. My aim here is to engage Machine
> Learning methods in James Mail systems efficiently and easily.
> 3. Project Scope
> The project will be based on the James Server 3.0 framework, and
> currently it will restrict the categories to spam/not spam.
> •	Implement tokenization and  stemming Pre-processions
> •	Implement “TF-IDF” and “binary Vector” feature selection.
> •	Implement K-nearest neighbor algorithm to enforce the new Anti-Spam filter.
> •	Implement Light SVM algorithm to enforce the Anti-Spam filter.
> 4. Approach
> My idea is to use K-nearest neighbor algorithm to classify the mail as
> spam or non-spam. Computing space distance between different text
> words, KNN has better experimental performance in accuracy and
> efficiency [1].
>
> Tokenization and stemming Pre-processions
> This part will implement integrated with Mime4Jparser.
> Tokenization: In order to separate text into individual words. For
> example, the sentence “I’m working on the proposal” will turn into
> “I’m”, “working”, “on”, “the”, “proposal”.
> Stop word removal: to remove common words that are usually not useful
> for text classification. Example: to remove words such as “a”, “the”,
> “I”, “he”, “she”, “is”, “are”, etc.
> Stemming: to normalize words derived from the same root. For example:
> “attending” turns into “attend” and “teacher” turns into “teach””
>
> Feature Selection
> Both of the “TF-IDF” (case insensitive) and “binary Vector” (case
> insensitive) feature selection will be implemented.
> TF-IDF: The tf–idf weight (term frequency–inverse document frequency)
> is is a statistical measure used to evaluate how important a word is
> to a document in a collection or corpus. The importance increases
> proportionally to the number of times a word appears in the document
> but is offset by the frequency of the word in the corpus. Variations
> of the tf–idf weighting scheme are often used by search engines as a
> central tool in scoring and ranking a document's relevance given a
> user query. So it can avoid the bias in the Bayesian approach. The
> variable for TF-IDF will be tuned by the experiment, and also can be
> manually configured by users.
> Weighted TF-IDF: Based on TF-IDF, we can set different weights,
> considering the words position in the mail structure. I have an
> assumption: the words in the message header are more important than
> the body.
> Binary Vector: It maintains the information for whether each feature
> (term) exists or not (1 or 0) for each training data. The main
> variable here is the size of vector, which we will set and consider
> with memory size for this case.
> Variables Input: Vector Size, Ratio for TF-IDF and Weight for words position
> Feature extraction
> Good ideas do not always scale. In my approach the good Feature
> Construction will promise with efficiency in the model building and
> accuracy for the model prediction. Several approaches will be tested
> for this project, including mutual information [3], PCA[6] and Linear
> Discriminate Analysis [7], which are used as dimensionality reduction
> for the classification. Using Mutual information of each word with the
> class variable has been used to select important features.  Mutual
> information of a word captures the amount of information gained in
> classifying documents by knowing the presence and absence of the
> word.PCA is the simplest of the true eigenvector-based multivariate
> analyses. LDA assumes each class is Gaussian distributed with
> different means but the same covariance. It's probably a good idea to
> plot some of the data beforehand to make sure that this assumption is
> reasonable.
>
> SVM
> Support Vector Machine is a powerful machine learning technique that
> is able to work with high dimensional and noisy data. SVM use
> sophisticated kernel functions to map data points into a higher
> dimension and classify data in hyperspace, which is the complex
> decision surface in this high dimension. I will use the LibSVM[5] for
> the implementation.
> SVM method implements:
> 1)	Use of folding in the linear select top or a certain amount of
> document frequency terms.
> 2)	Create “TF-IDF” for each training data.
> 3)	Use SVM to train a model and get a predictive model.
>
> K-Nearest Neighbor (KNN)
> This method still needs to feed the predictive model in the training
> data, which can share the training data from Bayesian Analysis. The
> k-nearest neighbor’s algorithm (kNN) is a method for classifying
> objects based on closest training examples in the feature space. The
> k-nearest neighbor algorithm is amongst the simplest of all machine
> learning algorithms: an object is classified by a majority vote of its
> neighbors, with the object being assigned to the class most common
> amongst its k nearest neighbors (k is a positive integer, typically
> small). The KNN will use the same spam/no spam training data provided
> by users manually. But KNN model will be efficiently trained.
> For the complexity for computing the model, it will depend on two
> variables k and size of keywords [4]. For our case K is 2, because we
> just need classified spam/nonspam. It will be O(N*N) to build the
> model, since it will compute the distance between each note., N is the
> size of the vector setting.
> KNN method implements:
> •	Method 1: To create a “binary vector,” which maintains the
> information for whether each feature (term) exists or not (1 or 0) for
> each training data. For each unclassified mail to create a binary
> vector too, then use cosine distance to find out the top 2 closest
> training data from the. Then find out which class the unclassified
> mail belongs to. Compute Euclidean distance from target plot to those
> that were sampled. Order samples taking into account calculated
> distances. Choose heuristically optimal 2 nearest neighbors based on
> RMSE done by cross validation technique. Calculate an inverse distance
> weighted average with the k-nearest multivariate neighbors.
> •	Method 2: Use “TF-IDF” respectively instead of binary Vector in Method 1.
> Schedule
> 1 week understand Mailets API
> 1 week implement Pre-processions
> 1 week implement feature selection on the training dataset
> 2 weeks implement KNN method (1 weeks implement and 1 week test)
> 2 weeks implement Light SVM method (1 weeks implement and 1 week test)
> 2 weeks tune the predictive model with different parameters.
> 1 week finish the document and guide for using and tuning.
>
> My Bio
> Yu Fu, PhD student in University of Maryland, Baltimore County
> I have more than two years research experience in data mining. Now I
> am designing a new secure data mining algorithm in encrypted
> databases, which is funded by NSF. I implement the new regression
> tree, KNN and other data mining algorithm based on the Weka API to
> support my experiment. I have three publications about preserving
> privacy and security in data mining approach.
>
> [1] Anti-Spam Filter Based on Naïve Bayes, SVM, and KNN model.
> irlab.csie.ntu.edu.tw/~chean/perWeb/projects/AI/AI_final.pdf
> [2] An Empirical Performance Comparison of Machine Learning Methods
> for Spam E-Mail Categorization.
> http://www.computer.org/portal/web/csdl/doi/10.1109/ICHIS.2004.21
> [3] Text Categorization Using Weight Adjusted k-Nearest Neighbor
> Classification http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.109.1259&rep=rep1&type=pdf
> [4] Introduction to Information Retrieval, By Christopher D. Manning,
> Prabhakar Raghavan&  Hinrich Schütze
> http://nlp.stanford.edu/IR-book/html/htmledition/time-complexity-and-optimality-of-knn-1.html#tab:knncomp
> [5] LibSVM: http://www.csie.ntu.edu.tw/~cjlin/libsvm/
> [6] Jolliffe, I. T. (1986). Principal Component Analysis.
> Springer-Verlag. pp. 487. doi:10.1007/b98835. ISBN 978-0-387-95442-4.
> [7] McLachlan (2004) Discriminant Analysis and Statistical Pattern
> Recognition In: Wiley Interscience
>
>
>
>
>
> --
> Yu Fu
> yufu1@umbc.edu
> 443-388-6654
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: server-dev-unsubscribe@james.apache.org
> For additional commands, e-mail: server-dev-help@james.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: server-dev-unsubscribe@james.apache.org
For additional commands, e-mail: server-dev-help@james.apache.org