You are viewing a plain text version of this content. The canonical link for it is here.
Posted to server-user@james.apache.org by David Legg <da...@searchevent.co.uk> on 2012/10/23 02:12:06 UTC

Bayesian Analysis for v3

Hi all,

It's been a long time since I frequented this list!

After many years of faithful service I'm upgrading my server and thought 
I'd check to see what's happening with James.  I'm pleased to see v3 is 
beginning to emerge and I'll be happy to take it for a spin.

I see nothing much has changed with the Bayesian analysis mailet. It has 
performed very well for me and I'd definitely recommend it to people.  
However, I've just taken a look at the code for the first time and I 
think I'd like to have a go at improving it, especially as IMap is now a 
possibility.

I have a couple of ideas I'd like to try and I thought I'd air them here 
in case anyone has a brighter idea or some advice; thanks.

As it stands, the current Bayesian filter has a relatively simplistic 
tokenizer.  It literally seems to break the email into tokens with 
little regard to whether that bit of text is a mime boundary, base64, 
image, document or header etc.  My spam and ham database is filled with 
millions of random looking chunks of text mainly from base64 encoded 
images!  So my first plan is to make the tokenizer more intelligent.  It 
should carefully extract far more meta-data from the email.

I'm not the first to think of this of course.  Paul Graham originally 
wrote 'A Plan for Spam' [1] back in 2002 and then updated it with 
'Better Bayesian Filtering' [2] in 2003.  This spawned several projects 
and products.  The more feature complete version is SpamProbe [3] by 
Brian Burton but a Java version exists with a project called jASEN [4].  
This latter project has been quiet for a few years and was forked into a 
proprietary product as well.

I'm quite interested in the fact that James 3 supports IMap.  I think 
this may make it easier and more efficient for user's to maintain their 
own spam folder.  Currently user's have to send any spam (or ham) they 
receive to an address such as spam@xxx.yyy (or no-spam@xxx.yyy) and if 
they forget to send it as an attachment they risk poisoning the spam 
corpus.  Think how much easier it would be to simply move an email from 
one of your email folders to a special 'spam' folder.  Also think how 
much easier it would be to browse the spam folder looking for 
mis-classified emails and drag them back to the correct folder.  
Currently, I delete emails classified as spam and if someone wants it 
back I have to go rooting about in MySQL's binary logs!

I worry how big the spam folder may get if I'm not deleting spam 
messages.  I may have to automatically expire spam messages that get to 
a certain age.  Or it may be that a small amount of fastfailing reduces 
the spam intake to manageable amounts.

I'm not sure how IMap and POP3 play together yet.  I guess a user should 
only manage their email via IMap OR POP3 but not both.  Is that right?  
However, improving the Bayesian tokenizer should improve spam filtering 
for both access methods.

Best Regards,
David Legg


[1] http://www.paulgraham.com/spam.html
[2] http://www.paulgraham.com/better.html
[3] http://spamprobe.sourceforge.net/
[4] http://jasen.sourceforge.net/


---------------------------------------------------------------------
To unsubscribe, e-mail: server-user-unsubscribe@james.apache.org
For additional commands, e-mail: server-user-help@james.apache.org


Re: Bayesian Analysis for v3

Posted by Josip Almasi <jo...@vrspace.org>.
David Legg wrote:
> That's pretty straightforward actually.  Suppose you have a sentence "Mary had a little lamb" then you would generate the following token values in addition to the single word tokens if you were capturing a phrase size of 2: -
>
>    Maryhad
>    hada
>    alittle
>    littlelamb

Neat trick, I wonder how it works out.
Might be too large, especially with malformed MIME types.

> I recommend you read Paul Graham's 'Better Bayesian Filtering' [2] (especially the bit titled 'Tokens').  It's fascinating stuff... or maybe I'm getting too old and geeky :-)

Sure I did, quite a while ago.

>>> Image info needs extracting too.  So things like the width, height, bit depth, type of encoding, Exif data and any tags should all be captured.
>>
>> ...what would you use to extract image info?
>
> I haven't used any graphics libraries recently but a quick scan suggests 'Commons Sanselan' [3] which happily is an Apache project now.

Seams easy.
Broken link to MedatdataExample.java:/

Well, you got it all covered.

Regards...


---------------------------------------------------------------------
To unsubscribe, e-mail: server-user-unsubscribe@james.apache.org
For additional commands, e-mail: server-user-help@james.apache.org


Re: Bayesian Analysis for v3

Posted by David Legg <da...@searchevent.co.uk>.
>> That's very interesting.  Did you use the Mime4J library to do the 
>> heavy lifting or did you parse all the message yourself?
>
> I used javax.mail, started from a good mail parsing example included.
> Parsed html with javax.swing.text.html.HTMLEditorKit.

Thanks for the hint.  I'll take a look.

>
>> Not so sure about ignoring numbers though.  Certainly, need to 
>> capture IP addresses, HTML and CSS colour settings and also domain 
>> names.  I can see there will be a lot of tweaking involved.
>
> The catch with numbers is, I recieved some CSV files, containing 
> database table dumps, hundereds of thousands of lines, each containing 
> unique codes.

I understand where you are coming from now.

Ok, so the problem as I see it at the moment is that James isn't feeding 
the Bayes algorithm with quality tokens upon which it can work its 
statistical magic effectively.

I have to break any email down into its constituent parts (e.g headers, 
body, attachments) and then intelligently extract whatever useful 
metadata (or in the case of the email body its actual data) I can get.  
So when I talk about capturing 'numbers' I'm talking in the context of 
one of these constituent email parts and not necessarily the email as a 
whole.  I can see that it might even be beneficial in the future to have 
plugins that specialize in extracting tokens from particular mime 
types... but not just yet!

When I say a 'token' I'm thinking about an object which not only 
represents a string and how often that string has been seen in a ham or 
spam corpus but also a context and a timestamp.  The context records 
which part of the email we are talking about and the timestamp records 
the date and time of the last recorded occurrence of this token in an email.

I think the context is important because it lets the Bayes algorithm 
learn for example that 'Free!' seen in the Subject: header is more 
spammy than the same string seen in the body of an email.

The timestamp will enable the otherwise ever increasing spam and ham 
corpus to be kept in check by deleting those tokens whose counts haven't 
risen above say 2 in 6 months.  I got this idea from Spamprobe [1].
>
> IP and domain names, I don't think so.
> Suppose you use dot as delimiter. Then, each byte of IP address 
> becomes a token, and gets own weight. Much the same with domains.
> Bayes should take care of rest.

So following on from what I said above I'm talking about IP addresses 
and domain names as seen in the context of headers. Here's an example 
from a recent spam: -

   Received: from tedwoodsports.dh.bytemark.co.uk (HELO User) (89.16.177.117)
     by banddtruckparts.com with ESMTPA; 16 Oct 2012 21:01:11 -0400


In this example I would extract a token with context: HEADER-RECEIVED-IP 
and value: '89.16.177.117'.  In fact knowing how IP addresses are 
constructed I could also record a similar token with value: 89.16.177 
because it may be statistically significant that any of the 256 
addresses that fall into this range are either spam or maybe even ham.  
I don't know that but at least I'm giving the Bayes algorithm extra info 
that it may find statistically significant.  If it isn't then that token 
will be deleted after a while anyway.

Similarly I would also create tokens with the context: 
HEADER-RECEIVED-FROM-DOMAIN and the following values: -

   tedwoodsports.dh.bytemark.co.uk
   dh.bytemark.co.uk
   bytemark.co.uk
   co.uk
   uk

>> I'm keen to capture phrases (ie. capturing two or more sequential 
>> words) as I've heard they improve detection at the expense of a 
>> larger token database.
>
> Any pointers?
>

That's pretty straightforward actually.  Suppose you have a sentence 
"Mary had a little lamb" then you would generate the following token 
values in addition to the single word tokens if you were capturing a 
phrase size of 2: -

   Maryhad
   hada
   alittle
   littlelamb

I recommend you read Paul Graham's 'Better Bayesian Filtering' [2] 
(especially the bit titled 'Tokens').  It's fascinating stuff... or 
maybe I'm getting too old and geeky :-)

>> Image info needs extracting too.  So things like the width, height, 
>> bit depth, type of encoding, Exif data and any tags should all be 
>> captured.
>
> ...what would you use to extract image info?

I haven't used any graphics libraries recently but a quick scan suggests 
'Commons Sanselan' [3] which happily is an Apache project now.  When it 
comes to extracting meta data from MS Documents I think Apache Poi [4] 
is still a good choice.

David.

[1] http://spamprobe.sourceforge.net/
[2] http://www.paulgraham.com/better.html
[3] http://commons.apache.org/imaging/
[4] http://poi.apache.org/


---------------------------------------------------------------------
To unsubscribe, e-mail: server-user-unsubscribe@james.apache.org
For additional commands, e-mail: server-user-help@james.apache.org


Re: Bayesian Analysis for v3

Posted by Josip Almasi <jo...@vrspace.org>.
David Legg wrote:
> Hi Josip,
>
> Thanks for your comments.
>
> On 24/10/12 15:42, Josip Almasi wrote:
>>
>> I think I'll wait till it works with java 7. (workaround didn't work for me)
>
> I didn't know that.  I'm Ok with Java 6 for the moment as that is the default with Ubuntu 12.04.  Still not quite comfortable with this iced tea business though... I prefer 100% Java beans :-)

Well, new JAXB broke more applications. Right now I can't remember exactly which ones, but I had to go back to JDK 6.

>>> So my first plan is to make the tokenizer more intelligent.  It should carefully extract far more meta-data from the email.
>>
>> Wrote some mail parsing code, parses plain text and html, ignores other MIME types. For others, I guess only headers should be taken into account.
>> Malformed MIMEs are real issue there. So I used heuristics to avoid them - number of tokens and size of tokens.
>> Also, better ignore numbers, or use them as delimiters.
>> Of course, all message parts need to be processed. That's not cheap, and should be limited, by max allowed time and/or number of tokens.
>
> That's very interesting.  Did you use the Mime4J library to do the heavy lifting or did you parse all the message yourself?

I used javax.mail, started from a good mail parsing example included.
Parsed html with javax.swing.text.html.HTMLEditorKit.

It's for my mail archiver, not (yet) having anything to do with JAMES:
http://sf.net/projects/mar

So I did sort of opposite of what antispam is intended to do: I captured only 'good' keywords.

> That's a good point about malformed MIMEs.  Even with the relatively small number of spams I've collected I noticed a number of deviant practices.

Tell me about it, one even managed to produce StackOverflowError in html parser:>

> Not so sure about ignoring numbers though.  Certainly, need to capture IP addresses, HTML and CSS colour settings and also domain names.  I can see there will be a lot of tweaking involved.

Ah CSS, I forgot about it completelly. True, has to be analyzed.
Uh, HTML... right, for antispam purposes, tags need to be saved too.

The catch with numbers is, I recieved some CSV files, containing database table dumps, hundereds of thousands of lines, each containing unique codes.
And of course, many many smaller ones, with various server logs etc.
Best being left alone.

IP and domain names, I don't think so.
Suppose you use dot as delimiter. Then, each byte of IP address becomes a token, and gets own weight. Much the same with domains.
Bayes should take care of rest.
IP addresses are relatively rare in mails, domains being much more important.
Now, should we tokenize www.spammer.com, then weight www, spammer, and com, or should we store domain as it is?
I think - tokenize.
It's just a bit more processing, but possibly much less storage:
- one "www" and "com" instead of zillion stored
- two dots less
- "spammer" is just another keyword stored, weighted, possible to occur in other mails containing no domain www.spammer.com
(this is all about message content of course, headers should not be tokenized)

Anyway, here's my delimiter list:
" ,./<>?`~!@#$%^&*()_+=-{}|[]\\;':\"\r\n\t1234567890"
Though numbers should probably be excluded:)
Watching parsing time and keyword number should eliminate problems with numbers.

> I'm keen to capture phrases (ie. capturing two or more sequential words) as I've heard they improve detection at the expense of a larger token database.

Any pointers?

I don't know... quite complicated.

Though some lexical comparison might make sense. Here I wrote some examples, but that got 7.1 spam score and was returned back to me:)

> Image info needs extracting too.  So things like the width, height, bit depth, type of encoding, Exif data and any tags should all be captured.
> I quite often get large (several megabyte) emails from China containing pictures of products for me and the
> current James setup gives up with messages of that size.  Or rather it creates thousands of random tokens full of base64 segments!

That's interesting, I don't get these. At least not as single part messages. So bayes probably picked up other keywords of text/html part, and headers.
So I think that's too much effort for small gain.

Anyway, what would you use to extract image info?

Regards...


---------------------------------------------------------------------
To unsubscribe, e-mail: server-user-unsubscribe@james.apache.org
For additional commands, e-mail: server-user-help@james.apache.org


Re: Bayesian Analysis for v3

Posted by David Legg <da...@searchevent.co.uk>.
Hi Josip,

Thanks for your comments.

On 24/10/12 15:42, Josip Almasi wrote:
>
> I think I'll wait till it works with java 7. (workaround didn't work 
> for me)

I didn't know that.  I'm Ok with Java 6 for the moment as that is the 
default with Ubuntu 12.04.  Still not quite comfortable with this iced 
tea business though... I prefer 100% Java beans :-)

>> So my first plan is to make the tokenizer more intelligent.  It 
>> should carefully extract far more meta-data from the email.
>
> Wrote some mail parsing code, parses plain text and html, ignores 
> other MIME types. For others, I guess only headers should be taken 
> into account.
> Malformed MIMEs are real issue there. So I used heuristics to avoid 
> them - number of tokens and size of tokens.
> Also, better ignore numbers, or use them as delimiters.
> Of course, all message parts need to be processed. That's not cheap, 
> and should be limited, by max allowed time and/or number of tokens.

That's very interesting.  Did you use the Mime4J library to do the heavy 
lifting or did you parse all the message yourself?

That's a good point about malformed MIMEs.  Even with the relatively 
small number of spams I've collected I noticed a number of deviant 
practices.

Not so sure about ignoring numbers though.  Certainly, need to capture 
IP addresses, HTML and CSS colour settings and also domain names.  I can 
see there will be a lot of tweaking involved.

I'm keen to capture phrases (ie. capturing two or more sequential words) 
as I've heard they improve detection at the expense of a larger token 
database.

Image info needs extracting too.  So things like the width, height, bit 
depth, type of encoding, Exif data and any tags should all be captured.  
I quite often get large (several megabyte) emails from China containing 
pictures of products for me and the current James setup gives up with 
messages of that size.  Or rather it creates thousands of random tokens 
full of base64 segments!

>
>> I worry how big the spam folder may get if I'm not deleting spam 
>> messages.
>
> Well, I'm not deleting any spam:) You never know when you may need some;)
> Right now I have 143286 unread in my junk folder, total is 250k+, all 
> correctly marked as 100% spam, 850MB.
>

I'm envious... erm.... I think!  No seriously, that's got to be useful 
to you someday.  Maybe I should start collecting them instead of 
deleting them too.  I wonder how many of those are addressed to 
'johnsmithsvt' :-)

Happy tokenizing!
David.

---------------------------------------------------------------------
To unsubscribe, e-mail: server-user-unsubscribe@james.apache.org
For additional commands, e-mail: server-user-help@james.apache.org


Re: Bayesian Analysis for v3

Posted by Josip Almasi <jo...@vrspace.org>.
Hi,

David Legg wrote:
> Hi all,
>
> It's been a long time since I frequented this list!
>
> After many years of faithful service I'm upgrading my server and thought I'd check to see what's happening with James.  I'm pleased to see v3 is beginning to emerge and I'll be happy to take it for a spin.

Same here. Though I think I'll wait till it works with java 7. (workaround didn't work for me)

> I see nothing much has changed with the Bayesian analysis mailet. It has performed very well for me and I'd definitely recommend it to people. However, I've just taken a look at the code for the first time and I think I'd like to have a go at improving it,
> especially as IMap is now a possibility.
>
> I have a couple of ideas I'd like to try and I thought I'd air them here in case anyone has a brighter idea or some advice; thanks.
>
> As it stands, the current Bayesian filter has a relatively simplistic tokenizer.  It literally seems to break the email into tokens with little regard to whether that bit of text is a mime boundary, base64, image, document or header etc.  My spam and ham
> database is filled with millions of random looking chunks of text mainly from base64 encoded images!  So my first plan is to make the tokenizer more intelligent.  It should carefully extract far more meta-data from the email.

I might help you with that.
Wrote some mail parsing code, parses plain text and html, ignores other MIME types. For others, I guess only headers should be taken into account.
Malformed MIMEs are real issue there. So I used heuristics to avoid them - number of tokens and size of tokens.
Also, better ignore numbers, or use them as delimiters.
Of course, all message parts need to be processed. That's not cheap, and should be limited, by max allowed time and/or number of tokens.

> I'm not the first to think of this of course.  Paul Graham originally wrote 'A Plan for Spam' [1] back in 2002 and then updated it with 'Better Bayesian Filtering' [2] in 2003.  This spawned several projects and products.  The more feature complete version
> is SpamProbe [3] by Brian Burton but a Java version exists with a project called jASEN [4]. This latter project has been quiet for a few years and was forked into a proprietary product as well.
>
> I'm quite interested in the fact that James 3 supports IMap.  I think this may make it easier and more efficient for user's to maintain their own spam folder.  Currently user's have to send any spam (or ham) they receive to an address such as spam@xxx.yyy
> (or no-spam@xxx.yyy) and if they forget to send it as an attachment they risk poisoning the spam corpus.  Think how much easier it would be to simply move an email from one of your email folders to a special 'spam' folder.  Also think how much easier it
> would be to browse the spam folder looking for mis-classified emails and drag them back to the correct folder. Currently, I delete emails classified as spam and if someone wants it back I have to go rooting about in MySQL's binary logs!

Right!

> I worry how big the spam folder may get if I'm not deleting spam messages.  I may have to automatically expire spam messages that get to a certain age.  Or it may be that a small amount of fastfailing reduces the spam intake to manageable amounts.

Well, I'm not deleting any spam:) You never know when you may need some;)
Right now I have 143286 unread in my junk folder, total is 250k+, all correctly marked as 100% spam, 850MB.

Regards...


---------------------------------------------------------------------
To unsubscribe, e-mail: server-user-unsubscribe@james.apache.org
For additional commands, e-mail: server-user-help@james.apache.org