You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@spamassassin.apache.org by Justin Mason <jm...@jmason.org> on 2005/06/30 03:48:02 UTC

NOTICE: 3.1.0 rescoring mass-checks

OK, if you're planning to send us mass-check logs for the 3.1.0
rescoring, now's the time!

http://wiki.apache.org/spamassassin/RescoreDetails has all the
details.

cheers!

--j.

Re: NOTICE: 3.1.0 rescoring mass-checks

Posted by Rod Begbie <ro...@gmail.com>.
Should I be concerned about the swath of these messages I'm seeing:

bayes: cannot open bayes databases
/Users/rod/sa/Mail-SpamAssassin-3.1.0/masses/spamassassin/bayes_* R/W:
lock failed: Interrupted system call
bayes: cannot open bayes databases
/Users/rod/sa/Mail-SpamAssassin-3.1.0/masses/spamassassin/bayes_* R/W:
lock failed: Interrupted system call
bayes: cannot open bayes databases
/Users/rod/sa/Mail-SpamAssassin-3.1.0/masses/spamassassin/bayes_* R/W:
lock failed: Interrupted system call

Rod.

-- 
:: Rod Begbie :: http://groovymother.com/ ::

Re: NOTICE: 3.1.0 rescoring mass-checks

Posted by Theo Van Dinter <fe...@apache.org>.
On Thu, Jun 30, 2005 at 11:02:14AM -0500, Michael Parker wrote:
> >bayes_path /home/corpus/SA/Mail-SpamAssassin-3.1.0/masses/spamassassin/bayes
> Wasn't necessary for me.  userstate_dir should be set to
> $FindBin::Bin/spamassassin, so no need to set bayes_path.

Hrm.  I used to have to set bayes_path, but maybe we fixed it at some point
and I never changed. :|

> Are you logged in?

Details... ;)

I must've been distracted yesterday, all these silly little issues.  Geesh.

-- 
Randomly Generated Tagline:
"It's beyond my comprehension why anyone would attach a keyboard to a
 production Sun box in the first place :-)" - Michael Wei

Re: NOTICE: 3.1.0 rescoring mass-checks

Posted by Michael Parker <pa...@pobox.com>.
Theo Van Dinter wrote:

>On Wed, Jun 29, 2005 at 06:48:02PM -0700, Justin Mason wrote:
>  
>
>>http://wiki.apache.org/spamassassin/RescoreDetails has all the
>>details.
>>    
>>
>
>You'll want to set something like:
>
>bayes_path /home/corpus/SA/Mail-SpamAssassin-3.1.0/masses/spamassassin/bayes
>
>  
>
Wasn't necessary for me.  userstate_dir should be set to
$FindBin::Bin/spamassassin, so no need to set bayes_path.

>As well, so you don't overwrite (potentially) your own Bayes DB.  I'd edit the
>wiki page, but it's immutable. :(
>
>  
>
Are you logged in?

Michael

Re: NOTICE: 3.1.0 rescoring mass-checks

Posted by Michael Parker <pa...@pobox.com>.
Theo Van Dinter wrote:

>On Wed, Jun 29, 2005 at 06:48:02PM -0700, Justin Mason wrote:
>  
>
>>http://wiki.apache.org/spamassassin/RescoreDetails has all the
>>details.
>>    
>>
>
>You'll want to set something like:
>
>bayes_path /home/corpus/SA/Mail-SpamAssassin-3.1.0/masses/spamassassin/bayes
>
>  
>
Wasn't necessary for me.  userstate_dir should be set to
$FindBin::Bin/spamassassin, so no need to set bayes_path.

>As well, so you don't overwrite (potentially) your own Bayes DB.  I'd edit the
>wiki page, but it's immutable. :(
>
>  
>
Are you logged in?

Michael

Re: NOTICE: 3.1.0 rescoring mass-checks

Posted by Theo Van Dinter <fe...@apache.org>.
On Wed, Jun 29, 2005 at 06:48:02PM -0700, Justin Mason wrote:
> http://wiki.apache.org/spamassassin/RescoreDetails has all the
> details.

You'll want to set something like:

bayes_path /home/corpus/SA/Mail-SpamAssassin-3.1.0/masses/spamassassin/bayes

As well, so you don't overwrite (potentially) your own Bayes DB.  I'd edit the
wiki page, but it's immutable. :(

-- 
Randomly Generated Tagline:
And if Iraqs primary export was broccoli?

Re: NOTICE: 3.1.0 rescoring mass-checks

Posted by Rod Begbie <ro...@gmail.com>.
On 6/30/05, Theo Van Dinter <fe...@apache.org> wrote:
> > a) For those of us not intimately familiar with the numeric values of
> > date/time in perl, what --after value would bring us to Jan 1 2005?
> 
> Hrm.  1041397200 was 1/1/03.  +365 days is 1072933200, which was 1/1/04.
> +366 days is 1104555600, which was 1/1/05. :)

--after "-6 months" works for me.

Rod.

-- 
:: Rod Begbie :: http://groovymother.com/ ::

Re: NOTICE: 3.1.0 rescoring mass-checks

Posted by Theo Van Dinter <fe...@apache.org>.
On Thu, Jun 30, 2005 at 08:30:19PM -0700, Robert Menschel wrote:
> a) For those of us not intimately familiar with the numeric values of
> date/time in perl, what --after value would bring us to Jan 1 2005?

Hrm.  1041397200 was 1/1/03.  +365 days is 1072933200, which was 1/1/04.
+366 days is 1104555600, which was 1/1/05. :)

> b) I am concerned that starting a full rescoring mass-check against a
> large corpus will take longer than allowed.  I'll have to abort, and
> send in what I have, but "what I have" will be the results generated

I'd suggest letting it run for a little bit and estimate out how many
messages you can run through in the time allotted.  I'm doing the
same thing.  It's not 100%, but after 15-30 minutes you should be able
to multiply out and determine the # of messages you can run through (I
leave some wiggle room of 1-2 days), then restart the mass-check with
that many messages.

> for older emails, not newer emails. Would it be appropriate to have
> mass-check process emails newest to oldest?

No.  It needs to go in order, oldest to newest for Bayes.

-- 
Randomly Generated Tagline:
"I'm not bad, I'm just drawn that way." - Jessica Rabbit

Re[3]: NOTICE: 3.1.0 rescoring mass-checks

Posted by Robert Menschel <Ro...@Menschel.net>.
Thursday, June 30, 2005, 8:30:19 PM, I wrote:

TVD>> "The --after=1041397200 option tells mass-check to ignore messages older than
TVD>> 18 months ago (in this case January 1 2003). This is useful if your corpus has
TVD>> older messages intermingled with your newer messages."

TVD>> 18 months ago would be Jan 1 2004, not 2003.  We also usually limit to
TVD>> 6 months, not 18, but ...

RM> a) For those of us not intimately familiar with the numeric values of
RM> date/time in perl, what --after value would bring us to Jan 1 2005?

Never mind.  I checked an old mass-check log, and found that the last
2004 email generated this line:
> .  0 ./corpus.ham/h041231.ham.2114159 [rule hits here]
>      time=1104563602,mid=...

So I'm using this 1104563602 value as my "starting" time.

Bob Menschel




Re[2]: NOTICE: 3.1.0 rescoring mass-checks

Posted by Robert Menschel <Ro...@Menschel.net>.
Hello Theo,

Thursday, June 30, 2005, 8:59:47 AM, you wrote:

TVD> On Wed, Jun 29, 2005 at 06:48:02PM -0700, Justin Mason wrote:
>> http://wiki.apache.org/spamassassin/RescoreDetails has all the
>> details.

TVD> Just to note:

TVD> "The --after=1041397200 option tells mass-check to ignore messages older than
TVD> 18 months ago (in this case January 1 2003). This is useful if your corpus has
TVD> older messages intermingled with your newer messages."

TVD> 18 months ago would be Jan 1 2004, not 2003.  We also usually limit to
TVD> 6 months, not 18, but ...

a) For those of us not intimately familiar with the numeric values of
date/time in perl, what --after value would bring us to Jan 1 2005?

b) I am concerned that starting a full rescoring mass-check against a
large corpus will take longer than allowed.  I'll have to abort, and
send in what I have, but "what I have" will be the results generated
for older emails, not newer emails. Would it be appropriate to have
mass-check process emails newest to oldest?

Bob Menschel




Re[2]: NOTICE: 3.1.0 rescoring mass-checks

Posted by Robert Menschel <Ro...@Menschel.net>.
Hello Nix,

Saturday, July 2, 2005, 2:46:52 AM, you wrote:

N> This is far more elaborate than needed, I think. Limiting the age of
N> your spam corpus (which I do anyway) and using mass-check normally will
N> do the trick, as mass-check runs through mails in temporal order.  The
N> only `error' will be that ham of age [now - a couple of years] will
N> cohabit in the Bayes DB with spam of age [now - six months]. If this
N> caused a problem Bayes would be nearly useless anyway :)

Except, doing it this simple way (which is how I do normal, non-bayes
mass-checks), means that you'd load (autolearn) a year's worth of ham
into your Bayes database before giving it the first spam. Your Bayes
database will be out of balance until it has learned a significant
number of spam or
N> If expiry runs it ditches the ancient email first in any case.
until the first significant expiry gets rid of much of that older ham.

N> I think I'll do a few local perceptron runs with mass-checks with
N> different --limits after the rescoring mass-check is completed, and
N> see just what effect varying the limit on ham actually has. I'm
N> blithering in the absence of data right now.

Good idea. I'm interested to know what you find.

Bob Menschel




Re: NOTICE: 3.1.0 rescoring mass-checks

Posted by Nix <ni...@esperi.org.uk>.
On Fri, 1 Jul 2005, Robert Menschel yowled:
> Since I wasn't mass-checking Bayes, all I did was one mass-check run
> specifying only my ham corpus, and then a second mass-check run
> specifying only my spam corpus.  I then combined them for the
> frequency analysis.
> 
> It should be feasible to modify the rescoring mass-check instructions
> so you do something like:
> a) initialize the mass-check (including remove any prior Bayes
> database)
> b) split your ham corpus (1-2 years) into 10 equal parts. Split your
> spam corpus (2-6 months) into 10 equal parts.
> c) Cycle through your 20 corpus files, running mass-check on each:
> oldest ham, oldest spam, next oldest ham, next oldest spam, etc.
> d) Combine all ham logs into one, combine all spam logs into one.
> 
> It's not optimal, in that Bayes will be trained on emails out of time
> sequence, but it should shuffle them enough to get useful results out
> of it, IMO.

This is far more elaborate than needed, I think. Limiting the age of
your spam corpus (which I do anyway) and using mass-check normally will
do the trick, as mass-check runs through mails in temporal order.  The
only `error' will be that ham of age [now - a couple of years] will
cohabit in the Bayes DB with spam of age [now - six months]. If this
caused a problem Bayes would be nearly useless anyway :)

If expiry runs it ditches the ancient email first in any case.


I think I'll do a few local perceptron runs with mass-checks with
different --limits after the rescoring mass-check is completed, and
see just what effect varying the limit on ham actually has. I'm
blithering in the absence of data right now.

-- 
`I lost interest in "blade servers" when I found they didn't throw knives
 at people who weren't supposed to be in your machine room.'
    --- Anthony de Boer

Re[2]: NOTICE: 3.1.0 rescoring mass-checks

Posted by Robert Menschel <Ro...@Menschel.net>.
Hello Nix,

Friday, July 1, 2005, 5:00:00 PM, you wrote:

N> On Thu, 30 Jun 2005, Theo Van Dinter spake:
>> 18 months ago would be Jan 1 2004, not 2003.  We also usually limit to
>> 6 months, not 18, but ...

N> Six months isn't much for ham at all, is it? That would only give me a
N> thousand or so hams, and more than a hundred times as much spam as ham.

N> This seems a little... unbalanced. Ham doesn't change *that* fast.

N> (Maybe I should suck a few mailing lists into the ham, but I'm chary of
N> that because many of those lists may also be being used by others as
N> ham sources, so it may lead to duplication.)

I'm in a fortunate position that my corpus pulls in 20k ham and 20k
spam each week, so this isn't a concern for me at the moment.

However, previously my pattern was like yours, and when I would
mass-check, I'd mass-check on two years' ham vs 3 months' spam.

Since I wasn't mass-checking Bayes, all I did was one mass-check run
specifying only my ham corpus, and then a second mass-check run
specifying only my spam corpus.  I then combined them for the
frequency analysis.

It should be feasible to modify the rescoring mass-check instructions
so you do something like:
a) initialize the mass-check (including remove any prior Bayes
database)
b) split your ham corpus (1-2 years) into 10 equal parts. Split your
spam corpus (2-6 months) into 10 equal parts.
c) Cycle through your 20 corpus files, running mass-check on each:
oldest ham, oldest spam, next oldest ham, next oldest spam, etc.
d) Combine all ham logs into one, combine all spam logs into one.

It's not optimal, in that Bayes will be trained on emails out of time
sequence, but it should shuffle them enough to get useful results out
of it, IMO.

Bob Menschel




Re: NOTICE: 3.1.0 rescoring mass-checks

Posted by Nix <ni...@esperi.org.uk>.
On Thu, 30 Jun 2005, Theo Van Dinter spake:
> 18 months ago would be Jan 1 2004, not 2003.  We also usually limit to
> 6 months, not 18, but ...

Six months isn't much for ham at all, is it? That would only give me a
thousand or so hams, and more than a hundred times as much spam as ham.

This seems a little... unbalanced. Ham doesn't change *that* fast.

(Maybe I should suck a few mailing lists into the ham, but I'm chary of
that because many of those lists may also be being used by others as
ham sources, so it may lead to duplication.)

-- 
`I lost interest in "blade servers" when I found they didn't throw knives
 at people who weren't supposed to be in your machine room.'
    --- Anthony de Boer

Re: NOTICE: 3.1.0 rescoring mass-checks

Posted by Theo Van Dinter <fe...@apache.org>.
On Wed, Jun 29, 2005 at 06:48:02PM -0700, Justin Mason wrote:
> http://wiki.apache.org/spamassassin/RescoreDetails has all the
> details.

Just to note:

"The --after=1041397200 option tells mass-check to ignore messages older than
18 months ago (in this case January 1 2003). This is useful if your corpus has
older messages intermingled with your newer messages."

18 months ago would be Jan 1 2004, not 2003.  We also usually limit to
6 months, not 18, but ...

-- 
Randomly Generated Tagline:
"Eighty percent of married men cheat in America.  The rest cheat in Europe."
                      - Jackie Mason

Re: NOTICE: 3.1.0 rescoring mass-checks

Posted by Daniel Quinlan <qu...@pathname.com>.
Daryl C W O'Shea <sp...@dostech.ca> writes:

> The SPF_HELO_* rule semantics have changed too.  We now look for an SPF 
> record for the actual host name and not the registered domain name.
>
> Daryl

Hmmm... the reuse is probably good enough for the purpose of the
mass-checks.  We could compare efficacy now with before, but I doubt the
difference in end scores will be that large -- SPF isn't all that
accurate right now.  Anyway, I don't think it would be worth a pre4 to
remove the #reuse for this time around.

Daniel

-- 
Daniel Quinlan
http://www.pathname.com/~quinlan/

Re: NOTICE: 3.1.0 rescoring mass-checks

Posted by "Daryl C. W. O'Shea" <sp...@dostech.ca>.
Daniel Quinlan wrote:
> Theo Van Dinter <fe...@apache.org> writes:
> 
> 
>>Another related question is: why aren't all the net rules from 3.0 listed?
>>ie: URIBL_*, HABEAS*, RCVD_IN_RSL, SPF*, aren't listed.
> 
> 
> Urgh... some of those, the ones where the semantics have not changed
> much (including most or all of the SPF ones, even), should be reused.

The SPF_HELO_* rule semantics have changed too.  We now look for an SPF 
record for the actual host name and not the registered domain name.

Daryl


Re: Re[2]: NOTICE: 3.1.0 rescoring mass-checks

Posted by Theo Van Dinter <fe...@apache.org>.
On Fri, Jul 01, 2005 at 05:52:15PM -0700, Dan Quinlan wrote:
> SPF* and URIBL* are problems!

Hrm.  So reuse doesn't seem to work the way I thought/expected it to.
Right now, I have my mass-check going, and have very few hits for the
URIBL rules.  It looks like what's happening is that with --reuse,
mass-check assumes all mails were previously run through SpamAssassin,
and so would have an X-Spam-Status line to go parse.

Of course, that assumption is largely false -- there's no guarantee of
a former status line at all.  In my case, a vast majority of my corpus
(ham and spam traps) have no X-Spam-Status header because they would be
directly filed from procmail into an appropriate folder without being
scanned first (there was no reuse at the time).  So at the moment,
of the 172k spam messages processed so far, only 8k have a URIBL hit,
all personal spams that I received which were filtered when received.

What I expected to happen is that if there was an X-Spam-Status header in
the message, mass-check would reuse the net results.  If there wasn't, it'd
let the normal rules run as before.

-- 
Randomly Generated Tagline:
 "Hey, I'm startin' to get the hang of this game. The blerns are loaded. The
 count's three blerns and two anti-blerns, and the 
  infield blern rule is in effect. Right?" -Fry 
  "Other than the word blern, that was complete gibberish." -Leela 

Re: Re[2]: NOTICE: 3.1.0 rescoring mass-checks

Posted by Daniel Quinlan <qu...@pathname.com>.
Theo Van Dinter <fe...@apache.org> writes:

> Another related question is: why aren't all the net rules from 3.0 listed?
> ie: URIBL_*, HABEAS*, RCVD_IN_RSL, SPF*, aren't listed.

HABEAS rules have changed, although it might be possible to map from the
old rules to the new ones.  It's *probably* fine to use real-time for
these since they are mostly hand-scored anyway (and they have a low hit
rate).

RCVD_IN_RSL isn't in both 3.0 and 3.1 (gone in 3.1).

SPF* and URIBL* are problems!

Please review http://bugzilla.spamassassin.org/show_bug.cgi?id=4450 -- I
think we should strongly consider a pre3 and encourage people to
mass-check with it (and extend mass-checks).  mass-checks will be faster
with URIBL reuse anyway.

Also, we need to make a few small modifications to the directions:

 - double check the --after option (should be 12 months unless person
   needs to use 18 months of ham)

 - (not a huge deal) people should remove specific '#reuse' lines if
   (and only if) they did not run with that plugin enabled in 3.0 normal
   usage, but plan on using it in 3.1

Daniel

-- 
Daniel Quinlan
http://www.pathname.com/~quinlan/

Re: Re[2]: NOTICE: 3.1.0 rescoring mass-checks

Posted by Daniel Quinlan <qu...@pathname.com>.
Theo Van Dinter <fe...@apache.org> writes:

> Another related question is: why aren't all the net rules from 3.0 listed?
> ie: URIBL_*, HABEAS*, RCVD_IN_RSL, SPF*, aren't listed.

Urgh... some of those, the ones where the semantics have not changed
much (including most or all of the SPF ones, even), should be reused.

They are missing '#reuse' lines.  :-/

Daniel

-- 
Daniel Quinlan
http://www.pathname.com/~quinlan/

Re: Re[2]: NOTICE: 3.1.0 rescoring mass-checks

Posted by Theo Van Dinter <fe...@apache.org>.
On Thu, Jun 30, 2005 at 11:48:48PM -0700, Justin Mason wrote:
> yep -- this seems to be something to do with --reuse.  Hopefully
> Daniel will elucidate ;)

Another related question is: why aren't all the net rules from 3.0 listed?
ie: URIBL_*, HABEAS*, RCVD_IN_RSL, SPF*, aren't listed.

-- 
Randomly Generated Tagline:
"Getting impressive titles isn't hard if you work for people without a clue."
         - Theo about misleading "Senior Administrator" titles

Re[2]: NOTICE: 3.1.0 rescoring mass-checks

Posted by Robert Menschel <Ro...@Menschel.net>.

Thursday, June 30, 2005, 11:44:05 AM, you wrote:

TVD> On Wed, Jun 29, 2005 at 06:48:02PM -0700, Justin Mason wrote:
>> http://wiki.apache.org/spamassassin/RescoreDetails has all the
>> details.

A few thousand emails into my mass-check, I found
masses/spamassassin/mass_prefs, which includes:
bayes_auto_learn 0
lock_method flock
bayes_store_module Mail::SpamAssassin::BayesStore::SDBM
use_auto_whitelist 0
score DCC_CHECK 0
score DIGEST_MULTIPLE 0
score DNS_FROM_AHBL_RHSBL 0
score DNS_FROM_RFC_ABUSE 0
score DNS_FROM_RFC_BOGUSMX 0
score DNS_FROM_RFC_DSN 0
score DNS_FROM_RFC_POST 0
score DNS_FROM_RFC_WHOIS 0
score DNS_FROM_SECURITYSAGE 0
score NO_DNS_FOR_FROM 0
score PYZOR_CHECK 0
score RAZOR2_CF_RANGE_51_100 0
score RAZOR2_CHECK 0
score RCVD_IN_BL_SPAMCOP_NET 0
score RCVD_IN_BSP_OTHER 0
score RCVD_IN_BSP_TRUSTED 0
score RCVD_IN_DSBL 0
score RCVD_IN_NJABL_CGI 0
score RCVD_IN_NJABL_DUL 0
score RCVD_IN_NJABL_MULTI 0
score RCVD_IN_NJABL_PROXY 0
score RCVD_IN_NJABL_RELAY 0
score RCVD_IN_NJABL_SPAM 0
score RCVD_IN_SBL 0
score RCVD_IN_SORBS_BLOCK 0
score RCVD_IN_SORBS_DUL 0
score RCVD_IN_SORBS_HTTP 0
score RCVD_IN_SORBS_MISC 0
score RCVD_IN_SORBS_SMTP 0
score RCVD_IN_SORBS_SOCKS 0
score RCVD_IN_SORBS_SPAM 0
score RCVD_IN_SORBS_WEB 0
score RCVD_IN_SORBS_ZOMBIE 0
score RCVD_IN_WHOIS_INVALID 0
score RCVD_IN_XBL 0
score ROUND_THE_WORLD 0

Since there was nothing but the user_prefs file in this directory when
the mass-check started, I'm assuming it was created by and is being
used by mass-check.

Is it intentional that these network checks all get turned off even
though we're doing a --net mass-check?

Bob Menschel




Re: NOTICE: 3.1.0 rescoring mass-checks

Posted by Theo Van Dinter <fe...@apache.org>.
On Thu, Jun 30, 2005 at 02:44:05PM -0400, Theo Van Dinter wrote:
> Hrm.  So I was noticing that my results are including Razor2 and DCC rules,
> which leads me to believe that mass-check is reading the system *.pre files,
> not the rules/*.pre files.

In thinking about this some more, since it's pretty impossible based
on the code for mass-check to use the system *.pre files, I now believe
these results are most likely from --reuse.

FWIW.

-- 
Randomly Generated Tagline:
You tell 'em Bean, He's stringing you.

Re: NOTICE: 3.1.0 rescoring mass-checks

Posted by Theo Van Dinter <fe...@kluge.net>.
On Wed, Jun 29, 2005 at 06:48:02PM -0700, Justin Mason wrote:
> http://wiki.apache.org/spamassassin/RescoreDetails has all the
> details.

Hrm.  So I was noticing that my results are including Razor2 and DCC rules,
which leads me to believe that mass-check is reading the system *.pre files,
not the rules/*.pre files.

Is anyone else seeing this?

-- 
Randomly Generated Tagline:
To alcohol!  The cause of -- and solution to -- all of life's problems!
 
 		-- Homer Simpson
 		   Homer vs. the Eighteenth Amendment

Re: NOTICE: 3.1.0 rescoring mass-checks

Posted by Theo Van Dinter <fe...@apache.org>.
On Wed, Jun 29, 2005 at 06:48:02PM -0700, Justin Mason wrote:
> http://wiki.apache.org/spamassassin/RescoreDetails has all the
> details.

You'll want to set something like:

bayes_path /home/corpus/SA/Mail-SpamAssassin-3.1.0/masses/spamassassin/bayes

As well, so you don't overwrite (potentially) your own Bayes DB.  I'd edit the
wiki page, but it's immutable. :(

-- 
Randomly Generated Tagline:
And if Iraqs primary export was broccoli?