You are viewing a plain text version of this content. The canonical link for it is here.
Posted to server-user@james.apache.org by Chris Means <cm...@intfar.com> on 2002/08/16 20:48:55 UTC

Anti-SPAM mailet

Would anyone be interested in developing a maillet (or whatever) to
implement some of the anti-spam techniques described in this article
mentioned on /.?

http://www.paulgraham.com/spam.html

I'd rather it put a flag in the email so I could filter it in my email
client, but it would be nice to have the option of automatically forwarding
it to SPAMCop etc. for reporting purposes.

Any thoughts?

Thanks.

-Chris

RE: Anti-SPAM mailet

Posted by "Noel J. Bergman" <no...@devtech.com>.
Chris and Serge,

You might want to look at this:
http://research.microsoft.com/~horvitz/junkfilter.htm

Turns out this was a research project between some folks at Microsoft and
Stanford about four years ago.  Good data in the paper.

	--- Noel

-----Original Message-----
From: Chris Means [mailto:cmeans@intfar.com]
Sent: Sunday, August 18, 2002 23:47
To: Serge Knystautas; James Users List
Subject: RE: Anti-SPAM mailet


Hi Serge,

I've written a mailet that will work to build the word stats from messages
that are mailed to a particular email address on the server.

I'll be posting the code to the dev list for comments etc. later tonight.

My first pass uses a JDBC backend (table of words/occurances), which is
loads at James start, then saves when James it shutdown...just a first
pass...as I'd want the 'stats' updated more frequently.

IMAP just isn't there with James yet, and I wanted to make something that
would be relatively flexible (it's easy to just forward a message to a
'specific' account...one for SPAM samples, another for "good" samples).

Maybe we can put our parts together to make a whole...

-Chris

> -----Original Message-----
> From: Serge Knystautas [mailto:sergek@lokitech.com]
> Sent: Sunday, August 18, 2002 10:09 PM
> To: James Users List; cmeans@intfar.com
> Subject: Re: Anti-SPAM mailet
>
>
> Chris,
>
> I came across this link as well... I'm convinced this is a far more
> effective spam blocker than any blacklist/checksum/group spam blocker.
> Looks very very promising.
>
> I went ahead and put together a bunch of the code for this... I thought
> about how you would best want to build the corpus and for my money, I
> decided I would create the corpus based on IMAP folders.  I'm
> working on an
> ant task that could on a daily or weekly basis trove a set of IMAP folders
> to build the good and bad corpus.
>
> Anyway, but I wrote code to tokenize MimeMessages, the code that compares
> the good and bad corpus and builds the probability token set, the Bayesian
> calculator to combine the probabilities of the 15 most interesting words,
> and some other related utilities.  It's still a ways from being anything
> useful, and it would be really great once James has solid IMAP support.
>
> The hard part about this approach though is you need a decent sized corpus
> to make it really usable.  I think it's pretty clear you could have a
> matcher use the probability set to either mark the message as
> spam or not...
> but again building that corpus is the hardest.
>
> Serge Knystautas
> Loki Technologies
> http://www.lokitech.com/
> ----- Original Message -----
> From: "Chris Means" <cm...@intfar.com>
> To: <ja...@jakarta.apache.org>
> Sent: Friday, August 16, 2002 2:48 PM
> Subject: Anti-SPAM mailet
>
>
> > Would anyone be interested in developing a maillet (or whatever) to
> > implement some of the anti-spam techniques described in this article
> > mentioned on /.?
> >
> > http://www.paulgraham.com/spam.html
> >
> > I'd rather it put a flag in the email so I could filter it in my email
> > client, but it would be nice to have the option of automatically
> forwarding
> > it to SPAMCop etc. for reporting purposes.
> >
> > Any thoughts?
> >
> > Thanks.
> >
> > -Chris


--
To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
For additional commands, e-mail: <ma...@jakarta.apache.org>


RE: Anti-SPAM mailet

Posted by Chris Means <cm...@intfar.com>.
Hi Serge,

I've written a mailet that will work to build the word stats from messages
that are mailed to a particular email address on the server.

I'll be posting the code to the dev list for comments etc. later tonight.

My first pass uses a JDBC backend (table of words/occurances), which is
loads at James start, then saves when James it shutdown...just a first
pass...as I'd want the 'stats' updated more frequently.

IMAP just isn't there with James yet, and I wanted to make something that
would be relatively flexible (it's easy to just forward a message to a
'specific' account...one for SPAM samples, another for "good" samples).

Maybe we can put our parts together to make a whole...

-Chris

> -----Original Message-----
> From: Serge Knystautas [mailto:sergek@lokitech.com]
> Sent: Sunday, August 18, 2002 10:09 PM
> To: James Users List; cmeans@intfar.com
> Subject: Re: Anti-SPAM mailet
>
>
> Chris,
>
> I came across this link as well... I'm convinced this is a far more
> effective spam blocker than any blacklist/checksum/group spam blocker.
> Looks very very promising.
>
> I went ahead and put together a bunch of the code for this... I thought
> about how you would best want to build the corpus and for my money, I
> decided I would create the corpus based on IMAP folders.  I'm
> working on an
> ant task that could on a daily or weekly basis trove a set of IMAP folders
> to build the good and bad corpus.
>
> Anyway, but I wrote code to tokenize MimeMessages, the code that compares
> the good and bad corpus and builds the probability token set, the Bayesian
> calculator to combine the probabilities of the 15 most interesting words,
> and some other related utilities.  It's still a ways from being anything
> useful, and it would be really great once James has solid IMAP support.
>
> The hard part about this approach though is you need a decent sized corpus
> to make it really usable.  I think it's pretty clear you could have a
> matcher use the probability set to either mark the message as
> spam or not...
> but again building that corpus is the hardest.
>
> Serge Knystautas
> Loki Technologies
> http://www.lokitech.com/
> ----- Original Message -----
> From: "Chris Means" <cm...@intfar.com>
> To: <ja...@jakarta.apache.org>
> Sent: Friday, August 16, 2002 2:48 PM
> Subject: Anti-SPAM mailet
>
>
> > Would anyone be interested in developing a maillet (or whatever) to
> > implement some of the anti-spam techniques described in this article
> > mentioned on /.?
> >
> > http://www.paulgraham.com/spam.html
> >
> > I'd rather it put a flag in the email so I could filter it in my email
> > client, but it would be nice to have the option of automatically
> forwarding
> > it to SPAMCop etc. for reporting purposes.
> >
> > Any thoughts?
> >
> > Thanks.
> >
> > -Chris
> >
>
>
>


--
To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
For additional commands, e-mail: <ma...@jakarta.apache.org>


Re: Anti-SPAM mailet

Posted by Serge Knystautas <se...@lokitech.com>.
Chris,

I came across this link as well... I'm convinced this is a far more
effective spam blocker than any blacklist/checksum/group spam blocker.
Looks very very promising.

I went ahead and put together a bunch of the code for this... I thought
about how you would best want to build the corpus and for my money, I
decided I would create the corpus based on IMAP folders.  I'm working on an
ant task that could on a daily or weekly basis trove a set of IMAP folders
to build the good and bad corpus.

Anyway, but I wrote code to tokenize MimeMessages, the code that compares
the good and bad corpus and builds the probability token set, the Bayesian
calculator to combine the probabilities of the 15 most interesting words,
and some other related utilities.  It's still a ways from being anything
useful, and it would be really great once James has solid IMAP support.

The hard part about this approach though is you need a decent sized corpus
to make it really usable.  I think it's pretty clear you could have a
matcher use the probability set to either mark the message as spam or not...
but again building that corpus is the hardest.

Serge Knystautas
Loki Technologies
http://www.lokitech.com/
----- Original Message -----
From: "Chris Means" <cm...@intfar.com>
To: <ja...@jakarta.apache.org>
Sent: Friday, August 16, 2002 2:48 PM
Subject: Anti-SPAM mailet


> Would anyone be interested in developing a maillet (or whatever) to
> implement some of the anti-spam techniques described in this article
> mentioned on /.?
>
> http://www.paulgraham.com/spam.html
>
> I'd rather it put a flag in the email so I could filter it in my email
> client, but it would be nice to have the option of automatically
forwarding
> it to SPAMCop etc. for reporting purposes.
>
> Any thoughts?
>
> Thanks.
>
> -Chris
>



--
To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
For additional commands, e-mail: <ma...@jakarta.apache.org>


RE: Anti-SPAM mailet

Posted by "Noel J. Bergman" <no...@devtech.com>.
Chris,

> The advantage of SpamCop is that I'll still have the option to not report
an
> individual message as SPAM if it appears to be legit.

What you can do is forward spam to the FTC.  They actually WANT it.  I
posted a sample <mailet> tag a while back that should be in the archives.

	--- Noel


--
To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
For additional commands, e-mail: <ma...@jakarta.apache.org>


RE: Anti-SPAM mailet

Posted by Chris Means <cm...@intfar.com>.
Understood.

The advantage of SpamCop is that I'll still have the option to not report an
individual message as SPAM if it appears to be legit.

> -----Original Message-----
> From: Danny Angus [mailto:danny@apache.org]
> Sent: Friday, August 16, 2002 5:41 PM
> To: James Users List; cmeans@intfar.com
> Subject: RE: Anti-SPAM mailet
>
>
> Be aware that many blacklists dislike automated submissions
>
> > -----Original Message-----
> > From: Chris Means [mailto:cmeans@intfar.com]
> > Sent: 16 August 2002 19:49
> > To: james-user@jakarta.apache.org
> > Subject: Anti-SPAM mailet
> >
> >
> > Would anyone be interested in developing a maillet (or whatever) to
> > implement some of the anti-spam techniques described in this article
> > mentioned on /.?
> >
> > http://www.paulgraham.com/spam.html
> >
> > I'd rather it put a flag in the email so I could filter it in my email
> > client, but it would be nice to have the option of automatically
> > forwarding
> > it to SPAMCop etc. for reporting purposes.
> >
> > Any thoughts?
> >
> > Thanks.
> >
> > -Chris
> >
>


--
To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
For additional commands, e-mail: <ma...@jakarta.apache.org>


RE: Anti-SPAM mailet

Posted by Danny Angus <da...@apache.org>.
Be aware that many blacklists dislike automated submissions

> -----Original Message-----
> From: Chris Means [mailto:cmeans@intfar.com]
> Sent: 16 August 2002 19:49
> To: james-user@jakarta.apache.org
> Subject: Anti-SPAM mailet
> 
> 
> Would anyone be interested in developing a maillet (or whatever) to
> implement some of the anti-spam techniques described in this article
> mentioned on /.?
> 
> http://www.paulgraham.com/spam.html
> 
> I'd rather it put a flag in the email so I could filter it in my email
> client, but it would be nice to have the option of automatically 
> forwarding
> it to SPAMCop etc. for reporting purposes.
> 
> Any thoughts?
> 
> Thanks.
> 
> -Chris
> 

--
To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
For additional commands, e-mail: <ma...@jakarta.apache.org>


RE: Anti-SPAM mailet

Posted by "Noel J. Bergman" <no...@devtech.com>.
> > (4) You might want to look at
> > http://sourceforge.net/projects/pop3filterproxy/
> Wouldn't this be outside of James though?

I meant for some additional ideas on filtering spam.  :-)

> > (5) If we have a generic RegexMatcher ...
> This seems like a worthy project to get the experience with...

And not too difficult, either.  :-)

> > (6) We'll want a mailet capable of tagging a message.
> Are there any examples that have this functionality?

Yes and no.  There are examples of adding headers, but nothing generic.
Perhaps because there isn't a means (yet) to tag meta data on a mail object,
which could be used to communicate between different James components.  That
could be used to convey information such as a non-binary spam rating or a
spam reason.  But for a simple binary rating, one could do something like:

  <mailet matcher="SpamScan=..." class="AddHeader">
    <header>X-SPAM: true</header>
  </mailet>

Or we could drop the notion of a seperate matcher (for now), and combine the
two:

  <mailet matcher="All" class="SpamScanner">
     <param>...</param>
  </mailet>

Anyhow, that's just an idea.  See you on the developer's list.

	--- Noel


--
To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
For additional commands, e-mail: <ma...@jakarta.apache.org>


RE: Anti-SPAM mailet

Posted by Chris Means <cm...@intfar.com>.
> (1) You want a matcher

OK.  Right...I knew I'd probably get the terminology wrong <g>

> (2) James has a very flexible configuration approach to allow you
>     to do whatever you want with messages, using suitable matchers
>     and mailets.

Agreed.

> (3) There are several people interested in anti-SPAM matchers

Yep.

> (4) You might want to look at
> http://sourceforge.net/projects/pop3filterproxy/

Wouldn't this be outside of James though?

> (5) If we have a generic RegexMatcher (and some suitable
> subclasses), we can
>     detect all sorts of content.  I've posted some notes on this to the
>     mailing list in the past, and I have some code in raw state on my own
>     system.

I've not even attempted to write an extension of any sort for James yet.
This seems like a worthy project to get the experience with...

> (6) We'll want a mailet capable of tagging a message.

Are there any examples that have this functionality?

> If you are interesteding in working on James, you might want to join the
> James Developer mailing list, and participate.  All hands are
> welcome.  :-)

Will do...

Thanks.

-Chris


--
To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
For additional commands, e-mail: <ma...@jakarta.apache.org>


RE: Anti-SPAM mailet

Posted by "Noel J. Bergman" <no...@devtech.com>.
(1) You want a matcher
(2) James has a very flexible configuration approach to allow you
    to do whatever you want with messages, using suitable matchers
    and mailets.
(3) There are several people interested in anti-SPAM matchers
(4) You might want to look at
http://sourceforge.net/projects/pop3filterproxy/
(5) If we have a generic RegexMatcher (and some suitable subclasses), we can
    detect all sorts of content.  I've posted some notes on this to the
    mailing list in the past, and I have some code in raw state on my own
    system.
(6) We'll want a mailet capable of tagging a message.

If you are interesteding in working on James, you might want to join the
James Developer mailing list, and participate.  All hands are welcome.  :-)

	--- Noel

-----Original Message-----
From: Chris Means [mailto:cmeans@intfar.com]
Sent: Friday, August 16, 2002 14:49
To: james-user@jakarta.apache.org
Subject: Anti-SPAM mailet


Would anyone be interested in developing a maillet (or whatever) to
implement some of the anti-spam techniques described in this article
mentioned on /.?

http://www.paulgraham.com/spam.html

I'd rather it put a flag in the email so I could filter it in my email
client, but it would be nice to have the option of automatically forwarding
it to SPAMCop etc. for reporting purposes.

Any thoughts?

Thanks.

-Chris


--
To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
For additional commands, e-mail: <ma...@jakarta.apache.org>