You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@spamassassin.apache.org by Murty Rompalli <mu...@solar.murty.net> on 2005/01/01 04:52:59 UTC
Rule based on English words
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
Hash: SHA1
Hi
Any ideas on how to implement this are appreaciated:
Frequency Analysis of English Vocabulary and Grammar: Based on
the LOB Corpus by Stig Johansson and Knut Hofland (OUP, 1989, ISBN
0-19-8242212-2) gives the top eighteen words and their frequencies
as:
1. the 68315
2. of 35716
3. and 27856
4. to 26760
5. a 22744
6. in 21108
7. that 11188
8. is 10978
9. was 10499
10. it 10010
11. for 9299
12. he 8776
13. as 7337
14. with 7197
15. be 7186
16. on 7027
17. I 6696
18. his 6266
If the body contains http: ftp: or https: link, I want to test it further;
otherwise, skip this test. The test is as follows:
Check each paragraph that does not contain any of the above 18 words
(paragraphs seperated by \n).
1. For each para without common English words, assign a score.
2. For each para containing words with 0-9, ', " (anywhere), : and ~
(middle), assign score based on number of matches
Thanks
Murty Rompalli
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.3 (GNU/Linux)
iD8DBQFB1h6bqbgVhXQ+7mURAtafAKC++FtF6OZIkHC2hVD90509VTgFVwCfZPSw
wVqnkz5XYQOG8ZBGa8Pvow4=
=oON4
-----END PGP SIGNATURE-----