You are viewing a plain text version of this content. The canonical link for it is here.
Posted to server-dev@james.apache.org by "Robert Burrell Donkin (JIRA)" <se...@james.apache.org> on 2011/04/06 22:56:05 UTC
[jira] [Issue Comment Edited] (JAMES-1216) [gsoc2011] Design and implement machine learning filters and categorization for mail

    [ https://issues.apache.org/jira/browse/JAMES-1216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13016521#comment-13016521 ] 

Robert Burrell Donkin edited comment on JAMES-1216 at 4/6/11 8:56 PM:
----------------------------------------------------------------------

Feature Selection
-------------------------
Feature extraction from emails may potentially result in a large number of features, and so high dimensionality. 

For some algorithms, this may have undesirable performance consequences. For example, k-nearest neighbour implementations typically hold all training data in memory during classification, and computes distances between the test point and each training point. To understand this trade-off, it would be useful to estimate how memory and computation complexity scales with the number of features, and relate this to desired mail throughput. 

A strong GSOC application should probably consider feature selection, so that it can be factored into the design even if time does not allow a full implementation.

"An Introduction To Variable and Feature Selection" by Guyon and Elisseef; http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.85.3593&rep=rep1&type=pdf
"Fast Binary Feature Selection with Conditional Mutual Information" by Fleuret; http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.60.8398&rep=rep1&type=pdf

      was (Author: robertburrelldonkin):
    Feature Selection
-------------------------
Feature extraction from emails may potentially result in a large number of features, and so high dimensionality. For some algorithms 
  
> [gsoc2011] Design and implement machine learning filters and categorization for mail
> ------------------------------------------------------------------------------------
>
>                 Key: JAMES-1216
>                 URL: https://issues.apache.org/jira/browse/JAMES-1216
>             Project: JAMES Server
>          Issue Type: New Feature
>            Reporter: Eric Charles
>            Assignee: Eric Charles
>              Labels: gsoc2011
>
> Context: Anti-spam functionality based on SpamAssassin is available at James (base on mailets http://james.apache.org/mailet). Bayesian mailets are also available, but not completely integrated/documented. Nothing is available to automatically categorize mail traffic per user.
> Task: We are willing to align the existing implementation with any modern anti-spam solution based on powerfull machine learning implementation (such as apache mahout). We are also willing to extend the machine learning usage to some mail categorization (spam vs not-spam is a first category, we can extend it to any additional category we can imagine). The implementation can partially occur while spooling the mails and/or when mail is stored in mailbox.
> Related discussions: See also discussions on mail intelligent mining on http://markmail.org/message/2bodrwvdvtfq3f2v (mahout related) and http://markmail.org/thread/pksl6csyvoeo27yh (hama related).
> Mentor: eric at apache dot org & [fill in mentor]
> Complexity: high 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: server-dev-unsubscribe@james.apache.org
For additional commands, e-mail: server-dev-help@james.apache.org