You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@spamassassin.apache.org by Justin Mason <jm...@jmason.org> on 2007/01/16 11:21:14 UTC

Re: getting Bayes token data from spamassassin

Michael Parker writes:
> Stuart Robinson wrote:
> > Hello, all.
> > 
> >> On Mon, Jan 15, 2007 at 01:54:07AM -0800, Stuart Robinson wrote:
> >>> I've searched around a bit, both on gmane and Google, but I haven't
> >>> found much more information regarding your two points. What IS
> >>> stored in the token field of the table bayes_token? And how is the
> >>> SHA1 hash involved?
> >> A SHA1 hash is taken of the original token value, and the bottom 40
> >> bits are used as the token from then-on.  There is a plugin call
> >> which can be used to store raw token -> hash value data, but
> >> otherwise the raw token information is lost after the message is
> >> processed.
> > 
> > Where could I find more information about the plugin call that allows
> > me to do this? 
> 
> perldoc Mail::SpamAssassin::Plugin

In particular:

http://spamassassin.apache.org/full/3.1.x/doc/Mail_SpamAssassin_Plugin.html#item_bayes_scan
http://spamassassin.apache.org/full/3.1.x/doc/Mail_SpamAssassin_Plugin.html#item_bayes_learn

> You should also search the dev list from a couple of years ago at least.
> Lots of discussion about the change and why it was done including, if
> memory serves me correctly, a proof of concept plugin to save off the
> token values.

by the way, a nice, working plugin that does this would be quite useful on
the CustomPlugins wiki page, or contributed as an optional plugin...

--j.

Re: getting Bayes token data from spamassassin

Posted by Jonas Eckerman <jo...@frukt.org>.
Jonas Eckerman wrote:

> I do not consider my plugin "nice" since it uses DBI in such an unoptimized way.

I did optimize it slightly yesterday, so maybe I do consider it almost nice now. :-)

> It really should use a prepared statement

Now it does this.

It probably should use the DELAYED keyword When used with MySQL and MyISAM tables.

And optionally this as well.

> It should be made faster using INSERT with fallback to UPDATE

But not this.

It's good enough for me now (except for some cleanup), so I'm not going to spend any more time on optimizing it myself, but suggestions from others are not unwelcome.

Regards
/Jonas
-- 
Jonas Eckerman, FSDB & Fruktträdet
http://whatever.frukt.org/
http://www.fsdb.org/
http://www.frukt.org/


Re: getting Bayes token data from spamassassin

Posted by Jonas Eckerman <jo...@frukt.org>.
Jonas Eckerman wrote:
> Justin Mason wrote:
>> by the way, a nice, working plugin that does this would be quite useful

> Since it was so straight-forward I made a small plugin that collects the raw tokens in a SQL table.

An extra note:

I do not consider my plugin "nice" since it uses DBI in such an unoptimized way. I'm not very good at database programming, and this was a quick hack.

It really should use a prepared statement since it will perform the same operation a number of times for every learnt message. It probably should use the DELAYED keyword When used with MySQL and MyISAM tables. It should be made faster using INSERT with fallback to UPDATE (for the atime) rather than REPLACE INTO.

I might do those fixes. Or maybe you'll do them.

Regards
/Jonas

-- 
Jonas Eckerman, FSDB & Fruktträdet
http://whatever.frukt.org/
http://www.fsdb.org/
http://www.frukt.org/


Re: getting Bayes token data from spamassassin

Posted by Michael Parker <pa...@pobox.com>.
Jonas Eckerman wrote:
> Justin Mason wrote:
>> http://spamassassin.apache.org/full/3.1.x/doc/Mail_SpamAssassin_Plugin.html#item_bayes_learn
> 
> Thanks!
> 
>> by the way, a nice, working plugin that does this would be quite useful
> 
> Since it was so straight-forward I made a small plugin that collects the raw tokens in a SQL table.
> 

Very nice, thats pretty much what I envisioned when I created the plugin
hooks and very similar to my original proof of concept.

If you wanted to reduce the insert/update time you could also do
something like this:
http://jroller.com/page/dschneller?entry=mysql_replication_using_blackhole_engine


Once you have it like you want it, I suggest posting it to the
CustomPlugins wiki page so others can easily find it.

Michael

Re: getting Bayes token data from spamassassin

Posted by Jonas Eckerman <jo...@frukt.org>.
Justin Mason wrote:
> http://spamassassin.apache.org/full/3.1.x/doc/Mail_SpamAssassin_Plugin.html#item_bayes_learn

Thanks!

> by the way, a nice, working plugin that does this would be quite useful

Since it was so straight-forward I made a small plugin that collects the raw tokens in a SQL table.

I've only been using it for about an hour, so there may be well be problems with it. It ought to work though :-)
I've only tested it with MySQL, but it should work without mods with SQLite as well I think, and should be trivial to modify for other SQL servers.

If anyone wants to test it, it's called CollectTokens.pm and is available at <http://whatever.frukt.org/spamassassin.text.shtml>. Please tell me when yopu find any problems.

What to actually do with the collected data is up to you, but here's two example queries:

Top 10 ham tokens:
SELECT bayes_token.ham_count,bayes_rawtoken.rawtoken 
  FROM bayes_rawtoken,bayes_token 
  WHERE bayes_rawtoken.token=bayes_token.token
  ORDER BY bayes_token.ham_count DESC LIMIT 10;

Top 10 spam tokens:
SELECT bayes_token.spam_count,bayes_rawtoken.rawtoken 
  FROM bayes_rawtoken,bayes_token 
  WHERE bayes_rawtoken.token=bayes_token.token
  ORDER BY bayes_token.spam_count DESC LIMIT 10;

Not sure that this is useful for anything at all, but curiosity is part of human nature. :-)

Regards
/Jonas

-- 
Jonas Eckerman, FSDB & Fruktträdet
http://whatever.frukt.org/
http://www.fsdb.org/
http://www.frukt.org/


Re: getting Bayes token data from spamassassin

Posted by Theo Van Dinter <fe...@apache.org>.
On Tue, Jan 16, 2007 at 02:02:01PM -0800, Stuart Robinson wrote:
> Couldn't the raw tokens just be kept in the same database by adding an
> additional column to the table bayes_token that isn't indexed? That
> wouldn't affect performance too much, would it?

Besides requiring a new data layout for the DBM files, it would be space-wise
inefficient.

This issue was discussed to death a few years ago.  If you want this kind of
thing on your specific server, the plugin call allows you to write whatever
you want to deal with it.  I'd probably keep a new table w/ hash->raw token
mappings, or some kind of DBM.

-- 
Randomly Selected Tagline:
"Do not meddle in the affairs of wizards, for you are crunchy and good
 with ketchup."                  - Unknown

Re: getting Bayes token data from spamassassin

Posted by Stuart Robinson <st...@zapata.org>.
> On Tue, Jan 16, 2007 at 10:21:14AM +0000, Justin Mason wrote:
> > by the way, a nice, working plugin that does this would be quite useful on
> > the CustomPlugins wiki page, or contributed as an optional plugin...
> 
> The plugin itself is pretty trivial -- the question is: what to do with
> the token information?  Should it be sent out to a flat file, kept in
> a DBM, etc?  That's where the non-trivial stuff happens.

Couldn't the raw tokens just be kept in the same database by adding an
additional column to the table bayes_token that isn't indexed? That
wouldn't affect performance too much, would it?

+----------------------------------------+
| Stuart Robinson                        |
| Email: stuart at zapata dot org        |
| Homepage: http://www.zapata.org/stuart |
+----------------------------------------+


Re: getting Bayes token data from spamassassin

Posted by Theo Van Dinter <fe...@apache.org>.
On Tue, Jan 16, 2007 at 10:21:14AM +0000, Justin Mason wrote:
> by the way, a nice, working plugin that does this would be quite useful on
> the CustomPlugins wiki page, or contributed as an optional plugin...

The plugin itself is pretty trivial -- the question is: what to do with
the token information?  Should it be sent out to a flat file, kept in
a DBM, etc?  That's where the non-trivial stuff happens.

-- 
Randomly Selected Tagline:
"... then you'll excuse me, but I'm in the middle of fifteen things, all of
 them annoying."
         - Ivonova, Babylon 5 (Midnight on the Firing Line)

Re: getting Bayes token data from spamassassin

Posted by Stuart Robinson <st...@zapata.org>.
Thanks. Once I have this all figured out, I will write up something and
put it on my homepage and post a link to it here.

> > >> A SHA1 hash is taken of the original token value, and the bottom 40
> > >> bits are used as the token from then-on.  There is a plugin call
> > >> which can be used to store raw token -> hash value data, but
> > >> otherwise the raw token information is lost after the message is
> > >> processed.
> > > 
> > > Where could I find more information about the plugin call that allows
> > > me to do this? 
> > 
> > perldoc Mail::SpamAssassin::Plugin
> 
> In particular:
> 
> http://spamassassin.apache.org/full/3.1.x/doc/Mail_SpamAssassin_Plugin.html#item_bayes_scan
> http://spamassassin.apache.org/full/3.1.x/doc/Mail_SpamAssassin_Plugin.html#item_bayes_learn
> 
> > You should also search the dev list from a couple of years ago at least.
> > Lots of discussion about the change and why it was done including, if
> > memory serves me correctly, a proof of concept plugin to save off the
> > token values.
> 
> by the way, a nice, working plugin that does this would be quite useful on
> the CustomPlugins wiki page, or contributed as an optional plugin...

+----------------------------------------+
| Stuart Robinson                        |
| Email: stuart at zapata dot org        |
| Homepage: http://www.zapata.org/stuart |
+----------------------------------------+