You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@spamassassin.apache.org by Paul Reilly <pa...@tcd.ie> on 2005/03/14 17:46:06 UTC

bayesian tokens in text format?

Is it possible to dump the bayesian tokens in
human readable format still? It was quite useful
but since 3.0.x they seen to be base64 encoded or
some other way encoded. I couldn't see any sa-learn
option, or any FAQ entry about it.
Thanks
Paul


Re: bayesian tokens in text format?

Posted by Matt Kettler <mk...@evi-inc.com>.
AltGrendel wrote:

> Does this apply to Bayes/SQL too?


It should.  AFAIK, the hashing is done by  SA 3.0's bayes engine, so the 
kind of database used doesn't change the fact that tokens are hashed.


Re: bayesian tokens in text format?

Posted by Rick Beebe <ri...@yale.edu>.
>> In sa 3.0+ they are base-64 encodings of the SHA1 hash of the token. 
>> The hash is for all practical purposes not reversible.
>>
>>
> 
> Does this apply to Bayes/SQL too?

Yup.

mysql> select * from bayes_token limit 1;
+----+-------+------------+-----------+------------+
| id | token | spam_count | ham_count | atime      |
+----+-------+------------+-----------+------------+
|  3 | }C|Gá |       1983 |     15776 | 1111072469 |
+----+-------+------------+-----------+------------+

--Rick


Re: bayesian tokens in text format?

Posted by AltGrendel <al...@exit0.us>.
Matt Kettler wrote:

> At 11:46 AM 3/14/2005, Paul Reilly wrote:
>
>> Is it possible to dump the bayesian tokens in
>> human readable format still?
>
>
> No.
>
> In sa 3.0+ they are base-64 encodings of the SHA1 hash of the token. 
> The hash is for all practical purposes not reversible.
>
> This is done in part for privacy.. examining the text of a bayes DB 
> tells a lot about the email someone receives.. Examining the sha1 
> hashes tells you little about their email.
>
> It's also done for speed... SHA1 hashes are fixed size, which helps 
> optimize access to the database.
>
> However, it does have the drawback of making it difficult to inspect 
> the database manually for weirdness, but really, 9 times out of 10, 
> this inspection tends to lead to monkeying with the bayes DB more than 
> you should.
>
> Now, if you have a specific message that's a problem, running it 
> through spamassassin -D will show the tokens matched in text format. 
> So, if you have a specific problem, you can still use this to diagnose 
> what's going on.
>

Does this apply to Bayes/SQL too?

Re: bayesian tokens in text format?

Posted by Matt Kettler <mk...@evi-inc.com>.
At 11:46 AM 3/14/2005, Paul Reilly wrote:
>Is it possible to dump the bayesian tokens in
>human readable format still?

No.

In sa 3.0+ they are base-64 encodings of the SHA1 hash of the token. The 
hash is for all practical purposes not reversible.

This is done in part for privacy.. examining the text of a bayes DB tells a 
lot about the email someone receives.. Examining the sha1 hashes tells you 
little about their email.

It's also done for speed... SHA1 hashes are fixed size, which helps 
optimize access to the database.

However, it does have the drawback of making it difficult to inspect the 
database manually for weirdness, but really, 9 times out of 10, this 
inspection tends to lead to monkeying with the bayes DB more than you should.

Now, if you have a specific message that's a problem, running it through 
spamassassin -D will show the tokens matched in text format. So, if you 
have a specific problem, you can still use this to diagnose what's going on.


Re: bayesian tokens in text format?

Posted by Matt Kettler <mk...@evi-inc.com>.
At 01:11 PM 3/14/2005, Michael Parker wrote:
>In general, no it's not possible to dump the bayesian tokens in a
>readable (well they are readable, it's just hard to read them :))
>format, unless you do a little work yourself.  It is possible to dump
>them by making use the the given plugin hooks that allow you to fetch
>the "raw" token value and match it to the SHA1 hash for the token.

True, however, just given a bayes DB in 3.0's normal format, you can't dump 
it in text format. The plugin would have to have been running while the 
bayes DB was created.


>The primary motivation for the change was indeed speed, and let me
>tell you it was a lot.  Privacy never really entered into the picture,
>although I suppose it is a nice side effect, except that with a plugin
>it's pretty easy to map the token values.

True. I guess I mis-represented a desirable-to-some side effect as a reason 
for implementation. Speed was the big motivator.

>Of course, I have to ask, how do you find the data "quite useful?"

It's "quite useful" as dumping the bayes db through sort and looking at the 
tokens helps you identify tokens to look for that may be in misclassified 
messages.

ie: if I see an obfuscated Viagra variant with stats like" 0 spam 1 ham 
0.000", I know to go dig around in my archives for a misclassified message 
containing that word and re-train it properly.

However, as I said before, 9 times out of 10 doing this leads to people 
over-manipulating their bayes DB by deciding that a particular token "must 
be" spam or nonspam, and doing things like creating bogus messages to shift 
the training the way they want it. A lot of admins get really worried about 
one or two tokens that don't "look right"... Which is a bad thing.





RE: bayesian tokens in text format?

Posted by Ben Wylie <sa...@benwylie.co.uk>.
-----Original Message-----
> From: Michael Parker [mailto:parkerm@pobox.com] Sent: 14 March 2005 22:28
> > On Mon, Mar 14, 2005 at 10:23:37PM +0000, Paul Reilly wrote:
> > 
> > > Of course, I have to ask, how do you find the data "quite useful?"  I
> > 
> > It's useful to see what words/tokens are getting high scores.
> > The bayes database on one of my machines seems to be not
> > as accurate as the others, and results in msgs through that
> > machine are getting a negative bayes scoring. -1.7 etc
> > I wanted to see the tokens to see if I could see anything
> > unusual which might be causing this. But it's not a big issue.
> > 
>
> You don't need to see all of the tokens in the database to see this.
> There are several bayes based tags that can give you this sort of
> information on a per msg basis. perldoc Mail::SpamAssassin::Conf

Is there a way to get the bayes database to forget specified tokens?
I have just realised that some of the information that I have added to the
emails to indicate that they are spam, has not been removed before being
learnt. I'd like to manually tell it to forget these tokens. Is there any
way to do this?

Thanks
Ben



Re: bayesian tokens in text format?

Posted by Michael Parker <pa...@pobox.com>.
On Mon, Mar 14, 2005 at 10:23:37PM +0000, Paul Reilly wrote:
> 
> > Of course, I have to ask, how do you find the data "quite useful?"  I
> 
> It's useful to see what words/tokens are getting high scores.
> The bayes database on one of my machines seems to be not
> as accurate as the others, and results in msgs through that
> machine are getting a negative bayes scoring. -1.7 etc
> I wanted to see the tokens to see if I could see anything
> unusual which might be causing this. But it's not a big issue.
> 

You don't need to see all of the tokens in the database to see this.
There are several bayes based tags that can give you this sort of
information on a per msg basis. perldoc Mail::SpamAssassin::Conf

Michael

Re: bayesian tokens in text format?

Posted by Paul Reilly <pa...@tcd.ie>.
> Of course, I have to ask, how do you find the data "quite useful?"  I

It's useful to see what words/tokens are getting high scores.
The bayes database on one of my machines seems to be not
as accurate as the others, and results in msgs through that
machine are getting a negative bayes scoring. -1.7 etc
I wanted to see the tokens to see if I could see anything
unusual which might be causing this. But it's not a big issue.

Thanks all,

Paul


Re: bayesian tokens in text format?

Posted by Michael Parker <pa...@pobox.com>.
On Mon, Mar 14, 2005 at 04:46:06PM +0000, Paul Reilly wrote:
> 
> Is it possible to dump the bayesian tokens in
> human readable format still? It was quite useful
> but since 3.0.x they seen to be base64 encoded or
> some other way encoded. I couldn't see any sa-learn
> option, or any FAQ entry about it.

To expand a bit on what Matt said.

In general, no it's not possible to dump the bayesian tokens in a
readable (well they are readable, it's just hard to read them :))
format, unless you do a little work yourself.  It is possible to dump
them by making use the the given plugin hooks that allow you to fetch
the "raw" token value and match it to the SHA1 hash for the token.

FYI, the values you can see, via a --dump or --backup, are actually
hex representations of the binary SHA1 data.

The primary motivation for the change was indeed speed, and let me
tell you it was a lot.  Privacy never really entered into the picture,
although I suppose it is a nice side effect, except that with a plugin
it's pretty easy to map the token values.

I know, the next thing you're going to ask is how do I write a plugin
to do this, well, that is an exercise to the reader.  I did a proof of
concept back when I added the plugin hooks, and may have sent it to
the mailing list so check the archives.  For all the juicy details
check out the comments in this bug:
http://bugzilla.spamassassin.org/show_bug.cgi?id=3331

Of course, I have to ask, how do you find the data "quite useful?"  I
asked on the mailing list several times for examples of how people
might use that data and nothing came along that was very compelling,
at least enough for me to pursue a better more integrated fix.

Michael