You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@spamassassin.apache.org by Georg Sauthoff <gs...@TechFak.Uni-Bielefeld.DE> on 2006/12/15 23:12:21 UTC

Specification of sa-learn --backup format

Hi,

I couldn't find the specification of the sa-learn --backup or sa-learn --dump
format in the documentation. I am mainly interested in the spam/ham
token string and their observed frequency (and the totals of
tokens/messages).

E.g.
t       1       0       1162501869      <some_10digit_hex>
                        ^ a)             ^ b)

a) I guess some time format?
b) the spam/ham token string? how encoded?

or
s       h       <some_long_hex>@sa_generated

Where can I find that specification/information?

Best regards
Georg Sauthoff

Re: How to check message size?

Posted by Duncan Hill <sa...@nacnud.force9.co.uk>.

On Wednesday 20 December 2006 06:11, Kosmaj wrote:
> Forgive me for spamming the list, but I just realized
> that if I just score -10 to long messages, SP will keep on
> applying other rules, and it will take long time again.
> Therefore, what I need is a rule which will score -10 points
> and tell SP to stop processing of all other rules.

The standard answer is don't scan mail that large if you don't expect it to be 
spammy.  Saves the entire overhead of running a pile of code against a mail 
that doesn't need to be scanned.

Re: How to check message size?

Posted by Matt Kettler <mk...@verizon.net>.

Kosmaj wrote:
> Forgive me for spamming the list, but I just realized
> that if I just score -10 to long messages, SP will keep on
> applying other rules, and it will take long time again.
> Therefore, what I need is a rule which will score -10 points
> and tell SP to stop processing of all other rules.

Rather than using a rule for this, spamc has a built-in feature to
bypass scanning of large messages. It defaults to 250k, but you can
change it with the -s parameter to spamc.

If you're interested in speed, you probably should be using the
spamc/spamd pair instead of the "spamassassin" script to invoke
SpamAssassin anyway.

Re: How to check message size?

Posted by Kosmaj <ko...@yahoo.com>.

Forgive me for spamming the list, but I just realized
that if I just score -10 to long messages, SP will keep on
applying other rules, and it will take long time again.
Therefore, what I need is a rule which will score -10 points
and tell SP to stop processing of all other rules.

Thanks,
Kosmaj

--- Kosmaj <ko...@yahoo.com> wrote:
> I'm using JSpamAssassin, a pop3 proxy in Java
> (Win-2k/Outlook Express)
> and I have a problem that long messages (more than 1M bytes)
> are coming truncated. I think it's related to the 20sec time-out
> which is hard coded in Java. I have a fast Internet connection,
> at least 5Mbit/sec, but most likely it takes time to apply all
> SP rules to such a long message.
> I'm planning to change the Java code but right now I don't
> have enough time and instead I'd like to add a rule which
> will give minus 10 points to massages longer than a certain
> threshold. How can I do that?
> 
> Thanks a lot,
> Kosmaj
> 
> 
> __________________________________________________
> Do You Yahoo!?
> Tired of spam?  Yahoo! Mail has the best spam protection around 
> http://mail.yahoo.com 
> 

__________________________________________________
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com

How to check message size?

Posted by Kosmaj <ko...@yahoo.com>.

I'm using JSpamAssassin, a pop3 proxy in Java
(Win-2k/Outlook Express)
and I have a problem that long messages (more than 1M bytes)
are coming truncated. I think it's related to the 20sec time-out
which is hard coded in Java. I have a fast Internet connection,
at least 5Mbit/sec, but most likely it takes time to apply all
SP rules to such a long message.
I'm planning to change the Java code but right now I don't
have enough time and instead I'd like to add a rule which
will give minus 10 points to massages longer than a certain
threshold. How can I do that?

Thanks a lot,
Kosmaj


__________________________________________________
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com

Re: Specification of sa-learn --backup format

Posted by Theo Van Dinter <fe...@apache.org>.

On Fri, Dec 15, 2006 at 11:53:34PM +0000, Nigel Frankcom wrote:
> Cool - thanks Theo, is that in one of the FAQ's/Manuals somewhere?

I don't think so (or else I would have just pointed you there ;)),
but feel free to add it on the wiki. :)

-- 
Randomly Selected Tagline:
Backup aborted: Remove disk #92 and start over.

Re: Specification of sa-learn --backup format

Posted by Nigel Frankcom <ni...@blue-canoe.net>.

On Fri, 15 Dec 2006 17:26:50 -0500, Theo Van Dinter
<fe...@apache.org> wrote:

>On Fri, Dec 15, 2006 at 11:12:21PM +0100, Georg Sauthoff wrote:
>> t       1       0       1162501869      <some_10digit_hex>
>>                         ^ a)             ^ b)
>> 
>> a) I guess some time format?
>
>Yes, UNIX standard seconds since epoch (time_t).
>
>> b) the spam/ham token string? how encoded?
>
>partial sha1 hash -- really in binary, displayed in hex.
>
>so to put it all together:
>
>"t" for token.  number of times seen in spam.  number of times seen in ham.
>timestamp.  token encoded in hex.
>
>> s       h       <some_long_hex>@sa_generated
>
>This is the seen DB.  "s" for "seen", "h" for ham ("s" for spam), and the
>message id.
>
>> Where can I find that specification/information?
>
>I don't know if it's really documented other than in the source code.  It's
>not really meant to be edited, etc.

Cool - thanks Theo, is that in one of the FAQ's/Manuals somewhere?

Kind regards

Nigel

Re: Specification of sa-learn --backup format

Posted by Theo Van Dinter <fe...@apache.org>.

On Fri, Dec 15, 2006 at 11:12:21PM +0100, Georg Sauthoff wrote:
> t       1       0       1162501869      <some_10digit_hex>
>                         ^ a)             ^ b)
> 
> a) I guess some time format?

Yes, UNIX standard seconds since epoch (time_t).

> b) the spam/ham token string? how encoded?

partial sha1 hash -- really in binary, displayed in hex.

so to put it all together:

"t" for token.  number of times seen in spam.  number of times seen in ham.
timestamp.  token encoded in hex.

> s       h       <some_long_hex>@sa_generated

This is the seen DB.  "s" for "seen", "h" for ham ("s" for spam), and the
message id.

> Where can I find that specification/information?

I don't know if it's really documented other than in the source code.  It's
not really meant to be edited, etc.

-- 
Randomly Selected Tagline:
"Linux without source is like coffee without caffeine."   - Brian Moore