You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@spamassassin.apache.org by qian diao <qi...@hotmail.com> on 2013/10/24 23:02:26 UTC
spamassassin tokenization
Hi,
I am trying to use spamassasin tokenization result on some other machine learning methods, such as SVM, etc. The results from "sa-learn --dump" are token frequency in all ham or spam messages, and not on a per-message basis.
The token counts I want is like the following format:
Tokens msg0 msg1 ... msgM
token1 10 6 ... 0
......
tokenN 20 1 ... 2
If the data on a per-message basis is not available in current design, is there any ways to use spamassasin to do the tokenization only, then use my own statistical model for the classification?
Thanks,Qian