You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@spamassassin.apache.org by qian diao <qi...@hotmail.com> on 2013/10/24 23:02:26 UTC

spamassassin tokenization

Hi,
I am trying to use spamassasin tokenization result on some other machine learning methods, such as SVM, etc. The results from "sa-learn --dump" are token frequency in all ham or spam messages, and not on a per-message basis. 
The token counts I want is like the following format:








Tokens          msg0          msg1          ...  msgM
token1          10      6          ...  0
......
tokenN          20     1          ...  2 

If the data on a per-message basis is not available in current design, is there any ways to use spamassasin to do the tokenization only, then use my own statistical model for the classification?
Thanks,Qian