You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@spamassassin.apache.org by Murty Rompalli <mu...@solar.murty.net> on 2005/01/01 04:52:59 UTC

Rule based on English words

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Hash: SHA1

Hi
Any ideas on how to implement this are appreaciated:

Frequency Analysis of English Vocabulary and Grammar: Based on
the LOB Corpus by Stig Johansson and Knut Hofland (OUP, 1989, ISBN
0-19-8242212-2) gives the top eighteen words and their frequencies
as:

      1.  the       68315
      2.  of        35716
      3.  and       27856
      4.  to        26760
      5.  a         22744
      6.  in        21108
      7.  that      11188
      8.  is        10978
      9.  was       10499
     10.  it        10010
     11.  for        9299
     12.  he         8776
     13.  as         7337
     14.  with       7197
     15.  be         7186
     16.  on         7027
     17.  I          6696
     18.  his        6266

If the body contains http: ftp: or https: link, I want to test it further;
otherwise, skip this test. The test is as follows:

Check each paragraph that does not contain any of the above 18 words
(paragraphs seperated by \n).

1. For each para without common English words, assign a score.
2. For each para containing words with 0-9, ', " (anywhere), : and ~
(middle), assign score based on number of matches

Thanks
Murty Rompalli

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.3 (GNU/Linux)

iD8DBQFB1h6bqbgVhXQ+7mURAtafAKC++FtF6OZIkHC2hVD90509VTgFVwCfZPSw
wVqnkz5XYQOG8ZBGa8Pvow4=
=oON4
-----END PGP SIGNATURE-----