You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@community.apache.org by "Kevin A. McGrail (JIRA)" <ji...@apache.org> on 2018/02/05 04:39:00 UTC

[jira] [Updated] (COMDEV-260) GSOC 2018 SpamAssassin Bayes Token ID

     [ https://issues.apache.org/jira/browse/COMDEV-260?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Kevin A. McGrail updated COMDEV-260:
------------------------------------
    Description: 
From Diane F Skoll idea (used with permission):

We tokenize inbound messages and store the tokens on the server. In each message, we add links for doing training. When you click on a training link, the system trains the message based on the tokens stored on the server. In that way, you are training using exactly the tokens that the Bayes code saw.

For SA, the key point is a framework to store the Bayesian tokens from the email before delivery of the email so later, a "this is spam" "this is ham" mechanism can take advantage of that information without having the entire email.

Adding a header with the message id for the storage of the headers allows a framework to be built for train as spam, train as ham to be more readily built.

The issues you are pointing to have to deal more with the implementation of the this is spam/this is ham mechanism.

By storing just the tokens, there is less space and privacy & legal concerns are mitigated.

sa-learn would then be extended to use the message id and learn as spam/ham instead of feeding it the entire message.

 

 

Apache SpamAssassin is a mail filter to identify spam. It is an intelligent email filter which uses a diverse range of tests to identify unsolicited bulk email, more commonly known as Spam. These tests are applied to email headers and content to classify email using advanced statistical methods. 

In addition, SpamAssassin has a modular architecture that allows other technologies to be quickly wielded against spam and is designed for easy integration into virtually any email system. 

It is primarily written in Perl with a few bits in C and shell scripts for system integration.

The compendium at https://raptor.pccc.com/raptor.cgim?template=email_spam_compendium is helpful to understand some of the concepts with SpamAssassin

It will be helpful for a student in this project to understand SMTP but a willingness to learn and setup your own mail server on a Linux Distribution with SpamAssassin for a personal test domain will be very desired with assistance provided to get the basic framework for a sandbox for learning.

As email becomes more commodotized by major providers, knowledge of email systems and their security is dwindling.  This opportunity can provide real-world experience with an email security product that is employed by countless commercial systems in the world.

  was:
From DFS idea used with permission:

We tokenize inbound messages and store the tokens on the server. In each message, we add links for doing training. When you click on a training link, the system trains the message based on the tokens stored on the server. In that way, you are training using exactly the tokens that the Bayes code saw. 

For SA, the key point is a framework to store the Bayesian tokens from the email before delivery of the email so later, a "this is spam" "this is ham" mechanism can take advantage of that information without having the entire email.

Adding a header with the message id for the storage of the headers allows a framework to be built for train as spam, train as ham to be more readily built.

The issues you are pointing to have to deal more with the implementation of the this is spam/this is ham mechanism.

By storing just the tokens, there is less space and privacy & legal concerns are mitigated.

sa-learn would then be extended to use the message id and learn as spam/ham instead of feeding it the entire message.


> GSOC 2018 SpamAssassin Bayes Token ID
> -------------------------------------
>
>                 Key: COMDEV-260
>                 URL: https://issues.apache.org/jira/browse/COMDEV-260
>             Project: Community Development
>          Issue Type: Project
>            Reporter: Kevin A. McGrail
>            Priority: Major
>
> From Diane F Skoll idea (used with permission):
> We tokenize inbound messages and store the tokens on the server. In each message, we add links for doing training. When you click on a training link, the system trains the message based on the tokens stored on the server. In that way, you are training using exactly the tokens that the Bayes code saw.
> For SA, the key point is a framework to store the Bayesian tokens from the email before delivery of the email so later, a "this is spam" "this is ham" mechanism can take advantage of that information without having the entire email.
> Adding a header with the message id for the storage of the headers allows a framework to be built for train as spam, train as ham to be more readily built.
> The issues you are pointing to have to deal more with the implementation of the this is spam/this is ham mechanism.
> By storing just the tokens, there is less space and privacy & legal concerns are mitigated.
> sa-learn would then be extended to use the message id and learn as spam/ham instead of feeding it the entire message.
>  
>  
> Apache SpamAssassin is a mail filter to identify spam. It is an intelligent email filter which uses a diverse range of tests to identify unsolicited bulk email, more commonly known as Spam. These tests are applied to email headers and content to classify email using advanced statistical methods. 
> In addition, SpamAssassin has a modular architecture that allows other technologies to be quickly wielded against spam and is designed for easy integration into virtually any email system. 
> It is primarily written in Perl with a few bits in C and shell scripts for system integration.
> The compendium at https://raptor.pccc.com/raptor.cgim?template=email_spam_compendium is helpful to understand some of the concepts with SpamAssassin
> It will be helpful for a student in this project to understand SMTP but a willingness to learn and setup your own mail server on a Linux Distribution with SpamAssassin for a personal test domain will be very desired with assistance provided to get the basic framework for a sandbox for learning.
> As email becomes more commodotized by major providers, knowledge of email systems and their security is dwindling.  This opportunity can provide real-world experience with an email security product that is employed by countless commercial systems in the world.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@community.apache.org
For additional commands, e-mail: dev-help@community.apache.org