You are viewing a plain text version of this content. The canonical link for it is here.
Posted to server-user@james.apache.org by David Legg <da...@searchevent.co.uk> on 2012/10/23 02:12:06 UTC
Bayesian Analysis for v3
Hi all,
It's been a long time since I frequented this list!
After many years of faithful service I'm upgrading my server and thought
I'd check to see what's happening with James. I'm pleased to see v3 is
beginning to emerge and I'll be happy to take it for a spin.
I see nothing much has changed with the Bayesian analysis mailet. It has
performed very well for me and I'd definitely recommend it to people.
However, I've just taken a look at the code for the first time and I
think I'd like to have a go at improving it, especially as IMap is now a
possibility.
I have a couple of ideas I'd like to try and I thought I'd air them here
in case anyone has a brighter idea or some advice; thanks.
As it stands, the current Bayesian filter has a relatively simplistic
tokenizer. It literally seems to break the email into tokens with
little regard to whether that bit of text is a mime boundary, base64,
image, document or header etc. My spam and ham database is filled with
millions of random looking chunks of text mainly from base64 encoded
images! So my first plan is to make the tokenizer more intelligent. It
should carefully extract far more meta-data from the email.
I'm not the first to think of this of course. Paul Graham originally
wrote 'A Plan for Spam' [1] back in 2002 and then updated it with
'Better Bayesian Filtering' [2] in 2003. This spawned several projects
and products. The more feature complete version is SpamProbe [3] by
Brian Burton but a Java version exists with a project called jASEN [4].
This latter project has been quiet for a few years and was forked into a
proprietary product as well.
I'm quite interested in the fact that James 3 supports IMap. I think
this may make it easier and more efficient for user's to maintain their
own spam folder. Currently user's have to send any spam (or ham) they
receive to an address such as spam@xxx.yyy (or no-spam@xxx.yyy) and if
they forget to send it as an attachment they risk poisoning the spam
corpus. Think how much easier it would be to simply move an email from
one of your email folders to a special 'spam' folder. Also think how
much easier it would be to browse the spam folder looking for
mis-classified emails and drag them back to the correct folder.
Currently, I delete emails classified as spam and if someone wants it
back I have to go rooting about in MySQL's binary logs!
I worry how big the spam folder may get if I'm not deleting spam
messages. I may have to automatically expire spam messages that get to
a certain age. Or it may be that a small amount of fastfailing reduces
the spam intake to manageable amounts.
I'm not sure how IMap and POP3 play together yet. I guess a user should
only manage their email via IMap OR POP3 but not both. Is that right?
However, improving the Bayesian tokenizer should improve spam filtering
for both access methods.
Best Regards,
David Legg
[1] http://www.paulgraham.com/spam.html
[2] http://www.paulgraham.com/better.html
[3] http://spamprobe.sourceforge.net/
[4] http://jasen.sourceforge.net/
---------------------------------------------------------------------
To unsubscribe, e-mail: server-user-unsubscribe@james.apache.org
For additional commands, e-mail: server-user-help@james.apache.org
Re: Bayesian Analysis for v3
Posted by Josip Almasi <jo...@vrspace.org>.
David Legg wrote:
> That's pretty straightforward actually. Suppose you have a sentence "Mary had a little lamb" then you would generate the following token values in addition to the single word tokens if you were capturing a phrase size of 2: -
>
> Maryhad
> hada
> alittle
> littlelamb
Neat trick, I wonder how it works out.
Might be too large, especially with malformed MIME types.
> I recommend you read Paul Graham's 'Better Bayesian Filtering' [2] (especially the bit titled 'Tokens'). It's fascinating stuff... or maybe I'm getting too old and geeky :-)
Sure I did, quite a while ago.
>>> Image info needs extracting too. So things like the width, height, bit depth, type of encoding, Exif data and any tags should all be captured.
>>
>> ...what would you use to extract image info?
>
> I haven't used any graphics libraries recently but a quick scan suggests 'Commons Sanselan' [3] which happily is an Apache project now.
Seams easy.
Broken link to MedatdataExample.java:/
Well, you got it all covered.
Regards...
---------------------------------------------------------------------
To unsubscribe, e-mail: server-user-unsubscribe@james.apache.org
For additional commands, e-mail: server-user-help@james.apache.org
Re: Bayesian Analysis for v3
Posted by David Legg <da...@searchevent.co.uk>.
>> That's very interesting. Did you use the Mime4J library to do the
>> heavy lifting or did you parse all the message yourself?
>
> I used javax.mail, started from a good mail parsing example included.
> Parsed html with javax.swing.text.html.HTMLEditorKit.
Thanks for the hint. I'll take a look.
>
>> Not so sure about ignoring numbers though. Certainly, need to
>> capture IP addresses, HTML and CSS colour settings and also domain
>> names. I can see there will be a lot of tweaking involved.
>
> The catch with numbers is, I recieved some CSV files, containing
> database table dumps, hundereds of thousands of lines, each containing
> unique codes.
I understand where you are coming from now.
Ok, so the problem as I see it at the moment is that James isn't feeding
the Bayes algorithm with quality tokens upon which it can work its
statistical magic effectively.
I have to break any email down into its constituent parts (e.g headers,
body, attachments) and then intelligently extract whatever useful
metadata (or in the case of the email body its actual data) I can get.
So when I talk about capturing 'numbers' I'm talking in the context of
one of these constituent email parts and not necessarily the email as a
whole. I can see that it might even be beneficial in the future to have
plugins that specialize in extracting tokens from particular mime
types... but not just yet!
When I say a 'token' I'm thinking about an object which not only
represents a string and how often that string has been seen in a ham or
spam corpus but also a context and a timestamp. The context records
which part of the email we are talking about and the timestamp records
the date and time of the last recorded occurrence of this token in an email.
I think the context is important because it lets the Bayes algorithm
learn for example that 'Free!' seen in the Subject: header is more
spammy than the same string seen in the body of an email.
The timestamp will enable the otherwise ever increasing spam and ham
corpus to be kept in check by deleting those tokens whose counts haven't
risen above say 2 in 6 months. I got this idea from Spamprobe [1].
>
> IP and domain names, I don't think so.
> Suppose you use dot as delimiter. Then, each byte of IP address
> becomes a token, and gets own weight. Much the same with domains.
> Bayes should take care of rest.
So following on from what I said above I'm talking about IP addresses
and domain names as seen in the context of headers. Here's an example
from a recent spam: -
Received: from tedwoodsports.dh.bytemark.co.uk (HELO User) (89.16.177.117)
by banddtruckparts.com with ESMTPA; 16 Oct 2012 21:01:11 -0400
In this example I would extract a token with context: HEADER-RECEIVED-IP
and value: '89.16.177.117'. In fact knowing how IP addresses are
constructed I could also record a similar token with value: 89.16.177
because it may be statistically significant that any of the 256
addresses that fall into this range are either spam or maybe even ham.
I don't know that but at least I'm giving the Bayes algorithm extra info
that it may find statistically significant. If it isn't then that token
will be deleted after a while anyway.
Similarly I would also create tokens with the context:
HEADER-RECEIVED-FROM-DOMAIN and the following values: -
tedwoodsports.dh.bytemark.co.uk
dh.bytemark.co.uk
bytemark.co.uk
co.uk
uk
>> I'm keen to capture phrases (ie. capturing two or more sequential
>> words) as I've heard they improve detection at the expense of a
>> larger token database.
>
> Any pointers?
>
That's pretty straightforward actually. Suppose you have a sentence
"Mary had a little lamb" then you would generate the following token
values in addition to the single word tokens if you were capturing a
phrase size of 2: -
Maryhad
hada
alittle
littlelamb
I recommend you read Paul Graham's 'Better Bayesian Filtering' [2]
(especially the bit titled 'Tokens'). It's fascinating stuff... or
maybe I'm getting too old and geeky :-)
>> Image info needs extracting too. So things like the width, height,
>> bit depth, type of encoding, Exif data and any tags should all be
>> captured.
>
> ...what would you use to extract image info?
I haven't used any graphics libraries recently but a quick scan suggests
'Commons Sanselan' [3] which happily is an Apache project now. When it
comes to extracting meta data from MS Documents I think Apache Poi [4]
is still a good choice.
David.
[1] http://spamprobe.sourceforge.net/
[2] http://www.paulgraham.com/better.html
[3] http://commons.apache.org/imaging/
[4] http://poi.apache.org/
---------------------------------------------------------------------
To unsubscribe, e-mail: server-user-unsubscribe@james.apache.org
For additional commands, e-mail: server-user-help@james.apache.org
Re: Bayesian Analysis for v3
Posted by Josip Almasi <jo...@vrspace.org>.
David Legg wrote:
> Hi Josip,
>
> Thanks for your comments.
>
> On 24/10/12 15:42, Josip Almasi wrote:
>>
>> I think I'll wait till it works with java 7. (workaround didn't work for me)
>
> I didn't know that. I'm Ok with Java 6 for the moment as that is the default with Ubuntu 12.04. Still not quite comfortable with this iced tea business though... I prefer 100% Java beans :-)
Well, new JAXB broke more applications. Right now I can't remember exactly which ones, but I had to go back to JDK 6.
>>> So my first plan is to make the tokenizer more intelligent. It should carefully extract far more meta-data from the email.
>>
>> Wrote some mail parsing code, parses plain text and html, ignores other MIME types. For others, I guess only headers should be taken into account.
>> Malformed MIMEs are real issue there. So I used heuristics to avoid them - number of tokens and size of tokens.
>> Also, better ignore numbers, or use them as delimiters.
>> Of course, all message parts need to be processed. That's not cheap, and should be limited, by max allowed time and/or number of tokens.
>
> That's very interesting. Did you use the Mime4J library to do the heavy lifting or did you parse all the message yourself?
I used javax.mail, started from a good mail parsing example included.
Parsed html with javax.swing.text.html.HTMLEditorKit.
It's for my mail archiver, not (yet) having anything to do with JAMES:
http://sf.net/projects/mar
So I did sort of opposite of what antispam is intended to do: I captured only 'good' keywords.
> That's a good point about malformed MIMEs. Even with the relatively small number of spams I've collected I noticed a number of deviant practices.
Tell me about it, one even managed to produce StackOverflowError in html parser:>
> Not so sure about ignoring numbers though. Certainly, need to capture IP addresses, HTML and CSS colour settings and also domain names. I can see there will be a lot of tweaking involved.
Ah CSS, I forgot about it completelly. True, has to be analyzed.
Uh, HTML... right, for antispam purposes, tags need to be saved too.
The catch with numbers is, I recieved some CSV files, containing database table dumps, hundereds of thousands of lines, each containing unique codes.
And of course, many many smaller ones, with various server logs etc.
Best being left alone.
IP and domain names, I don't think so.
Suppose you use dot as delimiter. Then, each byte of IP address becomes a token, and gets own weight. Much the same with domains.
Bayes should take care of rest.
IP addresses are relatively rare in mails, domains being much more important.
Now, should we tokenize www.spammer.com, then weight www, spammer, and com, or should we store domain as it is?
I think - tokenize.
It's just a bit more processing, but possibly much less storage:
- one "www" and "com" instead of zillion stored
- two dots less
- "spammer" is just another keyword stored, weighted, possible to occur in other mails containing no domain www.spammer.com
(this is all about message content of course, headers should not be tokenized)
Anyway, here's my delimiter list:
" ,./<>?`~!@#$%^&*()_+=-{}|[]\\;':\"\r\n\t1234567890"
Though numbers should probably be excluded:)
Watching parsing time and keyword number should eliminate problems with numbers.
> I'm keen to capture phrases (ie. capturing two or more sequential words) as I've heard they improve detection at the expense of a larger token database.
Any pointers?
I don't know... quite complicated.
Though some lexical comparison might make sense. Here I wrote some examples, but that got 7.1 spam score and was returned back to me:)
> Image info needs extracting too. So things like the width, height, bit depth, type of encoding, Exif data and any tags should all be captured.
> I quite often get large (several megabyte) emails from China containing pictures of products for me and the
> current James setup gives up with messages of that size. Or rather it creates thousands of random tokens full of base64 segments!
That's interesting, I don't get these. At least not as single part messages. So bayes probably picked up other keywords of text/html part, and headers.
So I think that's too much effort for small gain.
Anyway, what would you use to extract image info?
Regards...
---------------------------------------------------------------------
To unsubscribe, e-mail: server-user-unsubscribe@james.apache.org
For additional commands, e-mail: server-user-help@james.apache.org
Re: Bayesian Analysis for v3
Posted by David Legg <da...@searchevent.co.uk>.
Hi Josip,
Thanks for your comments.
On 24/10/12 15:42, Josip Almasi wrote:
>
> I think I'll wait till it works with java 7. (workaround didn't work
> for me)
I didn't know that. I'm Ok with Java 6 for the moment as that is the
default with Ubuntu 12.04. Still not quite comfortable with this iced
tea business though... I prefer 100% Java beans :-)
>> So my first plan is to make the tokenizer more intelligent. It
>> should carefully extract far more meta-data from the email.
>
> Wrote some mail parsing code, parses plain text and html, ignores
> other MIME types. For others, I guess only headers should be taken
> into account.
> Malformed MIMEs are real issue there. So I used heuristics to avoid
> them - number of tokens and size of tokens.
> Also, better ignore numbers, or use them as delimiters.
> Of course, all message parts need to be processed. That's not cheap,
> and should be limited, by max allowed time and/or number of tokens.
That's very interesting. Did you use the Mime4J library to do the heavy
lifting or did you parse all the message yourself?
That's a good point about malformed MIMEs. Even with the relatively
small number of spams I've collected I noticed a number of deviant
practices.
Not so sure about ignoring numbers though. Certainly, need to capture
IP addresses, HTML and CSS colour settings and also domain names. I can
see there will be a lot of tweaking involved.
I'm keen to capture phrases (ie. capturing two or more sequential words)
as I've heard they improve detection at the expense of a larger token
database.
Image info needs extracting too. So things like the width, height, bit
depth, type of encoding, Exif data and any tags should all be captured.
I quite often get large (several megabyte) emails from China containing
pictures of products for me and the current James setup gives up with
messages of that size. Or rather it creates thousands of random tokens
full of base64 segments!
>
>> I worry how big the spam folder may get if I'm not deleting spam
>> messages.
>
> Well, I'm not deleting any spam:) You never know when you may need some;)
> Right now I have 143286 unread in my junk folder, total is 250k+, all
> correctly marked as 100% spam, 850MB.
>
I'm envious... erm.... I think! No seriously, that's got to be useful
to you someday. Maybe I should start collecting them instead of
deleting them too. I wonder how many of those are addressed to
'johnsmithsvt' :-)
Happy tokenizing!
David.
---------------------------------------------------------------------
To unsubscribe, e-mail: server-user-unsubscribe@james.apache.org
For additional commands, e-mail: server-user-help@james.apache.org
Re: Bayesian Analysis for v3
Posted by Josip Almasi <jo...@vrspace.org>.
Hi,
David Legg wrote:
> Hi all,
>
> It's been a long time since I frequented this list!
>
> After many years of faithful service I'm upgrading my server and thought I'd check to see what's happening with James. I'm pleased to see v3 is beginning to emerge and I'll be happy to take it for a spin.
Same here. Though I think I'll wait till it works with java 7. (workaround didn't work for me)
> I see nothing much has changed with the Bayesian analysis mailet. It has performed very well for me and I'd definitely recommend it to people. However, I've just taken a look at the code for the first time and I think I'd like to have a go at improving it,
> especially as IMap is now a possibility.
>
> I have a couple of ideas I'd like to try and I thought I'd air them here in case anyone has a brighter idea or some advice; thanks.
>
> As it stands, the current Bayesian filter has a relatively simplistic tokenizer. It literally seems to break the email into tokens with little regard to whether that bit of text is a mime boundary, base64, image, document or header etc. My spam and ham
> database is filled with millions of random looking chunks of text mainly from base64 encoded images! So my first plan is to make the tokenizer more intelligent. It should carefully extract far more meta-data from the email.
I might help you with that.
Wrote some mail parsing code, parses plain text and html, ignores other MIME types. For others, I guess only headers should be taken into account.
Malformed MIMEs are real issue there. So I used heuristics to avoid them - number of tokens and size of tokens.
Also, better ignore numbers, or use them as delimiters.
Of course, all message parts need to be processed. That's not cheap, and should be limited, by max allowed time and/or number of tokens.
> I'm not the first to think of this of course. Paul Graham originally wrote 'A Plan for Spam' [1] back in 2002 and then updated it with 'Better Bayesian Filtering' [2] in 2003. This spawned several projects and products. The more feature complete version
> is SpamProbe [3] by Brian Burton but a Java version exists with a project called jASEN [4]. This latter project has been quiet for a few years and was forked into a proprietary product as well.
>
> I'm quite interested in the fact that James 3 supports IMap. I think this may make it easier and more efficient for user's to maintain their own spam folder. Currently user's have to send any spam (or ham) they receive to an address such as spam@xxx.yyy
> (or no-spam@xxx.yyy) and if they forget to send it as an attachment they risk poisoning the spam corpus. Think how much easier it would be to simply move an email from one of your email folders to a special 'spam' folder. Also think how much easier it
> would be to browse the spam folder looking for mis-classified emails and drag them back to the correct folder. Currently, I delete emails classified as spam and if someone wants it back I have to go rooting about in MySQL's binary logs!
Right!
> I worry how big the spam folder may get if I'm not deleting spam messages. I may have to automatically expire spam messages that get to a certain age. Or it may be that a small amount of fastfailing reduces the spam intake to manageable amounts.
Well, I'm not deleting any spam:) You never know when you may need some;)
Right now I have 143286 unread in my junk folder, total is 250k+, all correctly marked as 100% spam, 850MB.
Regards...
---------------------------------------------------------------------
To unsubscribe, e-mail: server-user-unsubscribe@james.apache.org
For additional commands, e-mail: server-user-help@james.apache.org