You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@spamassassin.apache.org by Amir 'CG' Caspi <ce...@3phase.com> on 2013/07/29 07:12:09 UTC

Forgetting mis-learned email

Hi all,

	So, some of my FNs get autolearned as ham, and because of the 
way my mail queue is set up, I typically only see this once the mail 
reaches my MUA and has already been deleted from the online inbox.  I 
have one particular message that got autolearned as ham (but should 
be spam), and I'm trying to run it through sa-learn --forget... but 
it's not forgetting anything.  (It tells me "Forgot tokens from 0 
message(s) (1 message(s) examined)".)
	It would appear that something got changed between when SA 
autolearned this message as ham, and when my MUA processed it.

Is there any way I can get sa-learn to forget this message, by 
forcing the message-ID or something?  Or, am I basically stuck and my 
Bayes DB has now been poisoned by this mis-learned email that I can't 
forget?

I'll note that this doesn't always happen... sometimes sa-learn can 
forget mail that I paste in from my MUA.  This time, it's not working.

Any help is appreciated.

(At some point, I will need to change my mail handling so that mail 
won't get deleted from the server, either for some time period or 
indefinitely, so that I can properly redirect FN mail from my 
server-side inbox rather than the MUA... presumably this will resolve 
this issue, I hope.)

Thanks.
						--- Amir

Re: Forgetting mis-learned email

Posted by RW <rw...@googlemail.com>.
On Mon, 29 Jul 2013 18:30:40 +0200
Karsten Bräckelmann wrote:

> On Mon, 2013-07-29 at 18:21 +0200, Karsten Bräckelmann wrote:
> > Tried --forget without the To header?
> 
> Seeing that empty Subject header, instead of completely removing the
> To header it might also be an empty header ("To: " with a trailing
> space). Probably before or after the From header, or maybe before
> Subject...

The message ID comes from the date header, the first (bottom) received
header and the top half, up to 2k, of the body.

Perhaps the problem is due to Windows newlines.

Re: Forgetting mis-learned email

Posted by Karsten Bräckelmann <gu...@rudersport.de>.
On Mon, 2013-07-29 at 18:21 +0200, Karsten Bräckelmann wrote:
> Tried --forget without the To header?

Seeing that empty Subject header, instead of completely removing the To
header it might also be an empty header ("To: " with a trailing space).
Probably before or after the From header, or maybe before Subject...


-- 
char *t="\10pse\0r\0dtu\0.@ghno\x4e\xc8\x79\xf4\xab\x51\x8a\x10\xf4\xf4\xc4";
main(){ char h,m=h=*t++,*x=t+2*h,c,i,l=*x,s=0; for (i=0;i<l;i++){ i%8? c<<=1:
(c=*++x); c&128 && (s+=h); if (!(h>>=1)||!t[s+h]){ putchar(t[s]);h=m;s=0; }}}


Re: Forgetting mis-learned email

Posted by Karsten Bräckelmann <gu...@rudersport.de>.
On Mon, 2013-07-29 at 10:06 -0600, Amir 'CG' Caspi wrote:
> At 5:48 PM +0200 07/29/2013, Karsten Bräckelmann wrote:

> > See the Content-Type and Content-Transfer-Encoding headers.
> 
> There were none for this email.

Content-Type: text/plain
Content-Transfer-Encoding: 8bit


> The complete "raw source" (identical to what's actually in the mbox) is
> available here: http://pastebin.com/GsRsYzMD  (I've blanked out
> personally-identifying information, so the message ID has been
> modified for this paste, but obviously not in the original email).
> 
> I'm suspecting that it's the "undisclosed recipients" in the To: field
> that may be causing confusion with Bayes... I think the MUA inserted 
> that, but not entirely positive.

Possible indeed, and even likely, given it's position after any other
original header. The Status header is added by your MUA (and ignored by
Bayes).

Tried --forget without the To header?


-- 
char *t="\10pse\0r\0dtu\0.@ghno\x4e\xc8\x79\xf4\xab\x51\x8a\x10\xf4\xf4\xc4";
main(){ char h,m=h=*t++,*x=t+2*h,c,i,l=*x,s=0; for (i=0;i<l;i++){ i%8? c<<=1:
(c=*++x); c&128 && (s+=h); if (!(h>>=1)||!t[s+h]){ putchar(t[s]);h=m;s=0; }}}


Re: Forgetting mis-learned email

Posted by Amir 'CG' Caspi <ce...@3phase.com>.
On Mon, July 29, 2013 10:21 am, Karsten Bräckelmann wrote:
>> There were none for this email.
>
> Content-Type: text/plain
> Content-Transfer-Encoding: 8bit

Whoops.  I missed those...  I guess this could be why a 7-bit copy/paste
wouldn't work, and using the mbox file directly is required.

> Tried --forget without the To header?

Not yet, nor have I tried with an empty To header, or skipping the Subject
header.  I will give those a shot.  I'll note that I didn't see any
clearly 8-bit characters when I looked at the file (my text editor should
have shown those), but that may really be the issue... or AN issue, on top
of the To header.


On Mon, July 29, 2013 11:08 am, Benny Pedersen wrote:
> well here bayes score is lower then autolearnthreashhold so it learns
> its self as ham forever

Well, the "forever" part is what I'm trying to overcome, by using --forget.

> to me it seems like you only use bayes nothing else in spamassassin ?,
> disabled other plugins ?

No, I haven't disabled any other plugins.  No other tests hit for this
email when it was run through spamc/spamd.  Even running it through SA
manually now, the only other positive test is RCVD_IN_PSBL, and that's
probably because it has been reported since I received the message.  Other
emails get plenty of hits from other plugins... this one is simply not
hitting them.

> why is it learnt as ham in the first place ?

I think Karsten answered that one well - it's because of the autolearn
threshold.

> are you using diff bayes user ?

No, I'm using the same Bayes user now as when the mail was first scanned. 
I'm not THAT much of a newbie. ;-)

On Mon, July 29, 2013 2:13 pm, RW wrote:
> Perhaps the problem is due to Windows newlines.

>From my MUA, you mean?  I should note that I don't use Windows, I use a
Mac running OS X.  My MUA uses CR (not CRLF) line breaks, but the mail
server itself is Linux-based so the original email used pure LF line
breaks.  I made sure that the email I ran through sa-learn --forget used
LF line breaks, as well, which has worked in the past but not on this
email.

I don't think linefeeds are the problem, though... based on the above, I'm
strongly suspecting either an 8-bit to 7-bit translation error through my
unwise copy/paste routine, and/or the "undisclosed recipients" To
header-munging being the primary issue.

I'll try different combos and get back to you guys on what worked (if
anything).

Cheers.

						--- Amir


Re: Forgetting mis-learned email

Posted by Karsten Bräckelmann <gu...@rudersport.de>.
On Mon, 2013-07-29 at 19:08 +0200, Benny Pedersen wrote:
> X-Spam-Status: No, score=-1.9 required=5.0 tests=BAYES_00 autolearn=ham
> 	version=3.3.2
> 
> well here bayes score is lower then autolearnthreashhold so it learns 
> its self as ham forever

tflags learn (the Bayesian rules) are ignored when determining whether a
message should be trained upon. [1]

> why is it learnt as ham in the first place ?

Because bayes_auto_learn_threshold_nonspam is 0.1 by default, [1] and
the overall score sans tflags learn and friends is 0.0.


[1] http://spamassassin.apache.org/full/3.3.x/doc/Mail_SpamAssassin_Plugin_AutoLearnThreshold.html

-- 
char *t="\10pse\0r\0dtu\0.@ghno\x4e\xc8\x79\xf4\xab\x51\x8a\x10\xf4\xf4\xc4";
main(){ char h,m=h=*t++,*x=t+2*h,c,i,l=*x,s=0; for (i=0;i<l;i++){ i%8? c<<=1:
(c=*++x); c&128 && (s+=h); if (!(h>>=1)||!t[s+h]){ putchar(t[s]);h=m;s=0; }}}


Re: Forgetting mis-learned email

Posted by Benny Pedersen <me...@junc.eu>.
Amir 'CG' Caspi skrev den 2013-07-29 18:06:

> I'm suspecting that it's the "undisclosed recipients" in the To:
> field that may be causing confusion with Bayes... I think the MUA
> inserted that, but not entirely positive.

X-Spam-Status: No, score=-1.9 required=5.0 tests=BAYES_00 autolearn=ham
	version=3.3.2

well here bayes score is lower then autolearnthreashhold so it learns 
its self as ham forever

to me it seems like you only use bayes nothing else in spamassassin ?, 
disabled other plugins ?

why is it learnt as ham in the first place ?

are you using diff bayes user ?

Re: Forgetting mis-learned email

Posted by Amir 'CG' Caspi <ce...@3phase.com>.
At 5:48 PM +0200 07/29/2013, Karsten Bräckelmann wrote:
>I strongly suggest to NEVER copy-n-paste like that, but to either run
>sa-learn on an entire mbox, or *save* a single mail to a file. Since

For what it's worth, I also opened the mbox in a 
text editor and copied the actual raw message (as 
in, not from the MUA itself, but directly from 
the mbox), since I suspected exactly what you 
said.  Still no dice, unfortunately.

>See the Content-Type and Content-Transfer-Encoding headers.

There were none for this email.

The complete "raw source" (identical to what's 
actually in the mbox) is available here: 
http://pastebin.com/GsRsYzMD  (I've blanked out 
personally-identifying information, so the 
message ID has been modified for this paste, but 
obviously not in the original email).

I'm suspecting that it's the "undisclosed 
recipients" in the To: field that may be causing 
confusion with Bayes... I think the MUA inserted 
that, but not entirely positive.

Thanks.

						--- Amir

Re: Forgetting mis-learned email

Posted by Karsten Bräckelmann <gu...@rudersport.de>.
On Mon, 2013-07-29 at 09:31 -0600, Amir 'CG' Caspi wrote:
> >SA always needs a full, raw message, including all headers, alternative
> >parts, and attachments if any. And in particular regarding the sa-learn
> >message ID, almost every bit counts.
> 
> Yes, in this case the message was plain-text, no 
> attachment, no MIME, etc.  I used the "view raw 
> source" option in the MUA and pasted that text 
> into a separate file, then attempted to --forget 
> it.  Obviously, it didn't work.  Normally I take 
> the entire Junk mailbox and use --mbox on it 
> (although even that doesn't always work), but 
> this time I wanted to process this individual 
> message.

Even "View Source" isn't necessarily safe. Some MUAs re-flow headers, or
add an MUA-specific header. Plus, whatever is being used to display the
"raw" source, might eat or interpret special chars.

I strongly suggest to NEVER copy-n-paste like that, but to either run
sa-learn on an entire mbox, or *save* a single mail to a file. Since
both is past receiving the mail by an MUA, bayes_ignore_header can be
used to ignore specific headers.


> I suppose it's possible there is some 8-bit UTF 
> character in there that's not displaying but that 
> isn't copying and pasting... in which case using 
> the full mbox might be what I need.  But, I don't 
> think that this is the case... I think the 
> message literally is 7-bit plain-text and should 
> therefore be easy to copy/paste using raw source, 
> except for whatever my MUA did to it...

See the Content-Type and Content-Transfer-Encoding headers.


-- 
char *t="\10pse\0r\0dtu\0.@ghno\x4e\xc8\x79\xf4\xab\x51\x8a\x10\xf4\xf4\xc4";
main(){ char h,m=h=*t++,*x=t+2*h,c,i,l=*x,s=0; for (i=0;i<l;i++){ i%8? c<<=1:
(c=*++x); c&128 && (s+=h); if (!(h>>=1)||!t[s+h]){ putchar(t[s]);h=m;s=0; }}}


Re: Forgetting mis-learned email

Posted by Amir 'CG' Caspi <ce...@3phase.com>.
At 12:00 PM +0200 07/29/2013, Karsten Bräckelmann wrote:
>You're best bet is to just train what you have as spam, to counter the

Sure, I was planning to do that.  The reason I 
wanted to --forget it was to make sure that I 
wasn't learning it twice (once as ham, once as 
spam).

>You do not have to --forget a mis-trained message anyway, unless simply
>reverting the training is what you want. If you want to correct the
>auto-learn and train as --spam, SA will automatically imply the forget
>step, if the message has been seen (trained) before. (Of course, that

Right, but see above -- I knew that SA kept a 
list of messages it had learned but that the MUA 
can cause changes in the message (e.g. in some 
MIME header) that would cause the Bayes hash to 
be different, and thus the message would be 
considered "new."  In that case, it would be 
learned twice, instead of forgotten and 
re-learned.

>SA always needs a full, raw message, including all headers, alternative
>parts, and attachments if any. And in particular regarding the sa-learn
>message ID, almost every bit counts.

Yes, in this case the message was plain-text, no 
attachment, no MIME, etc.  I used the "view raw 
source" option in the MUA and pasted that text 
into a separate file, then attempted to --forget 
it.  Obviously, it didn't work.  Normally I take 
the entire Junk mailbox and use --mbox on it 
(although even that doesn't always work), but 
this time I wanted to process this individual 
message.

I suppose it's possible there is some 8-bit UTF 
character in there that's not displaying but that 
isn't copying and pasting... in which case using 
the full mbox might be what I need.  But, I don't 
think that this is the case... I think the 
message literally is 7-bit plain-text and should 
therefore be easy to copy/paste using raw source, 
except for whatever my MUA did to it...

Thanks.

						--- Amir

Re: Forgetting mis-learned email

Posted by Karsten Bräckelmann <gu...@rudersport.de>.
On Sun, 2013-07-28 at 23:12 -0600, Amir 'CG' Caspi wrote:
> 	So, some of my FNs get autolearned as ham, and because of the 
> way my mail queue is set up, I typically only see this once the mail 
> reaches my MUA and has already been deleted from the online inbox.  I 
> have one particular message that got autolearned as ham (but should 
> be spam), and I'm trying to run it through sa-learn --forget... but 
> it's not forgetting anything.  (It tells me "Forgot tokens from 0 
> message(s) (1 message(s) examined)".)
> 	It would appear that something got changed between when SA 
> autolearned this message as ham, and when my MUA processed it.
> 
> Is there any way I can get sa-learn to forget this message, by 
> forcing the message-ID or something?  Or, am I basically stuck and my 
> Bayes DB has now been poisoned by this mis-learned email that I can't 
> forget?

Poison is a little extreme. I'm sure your Bayes DB will be fine and
recover quickly.

You cannot force the ID -- well, not the correct one at least, because
that would require having the original message. You could, however,
modify the source code to try to forget each Bayes token
unconditionally, I guess. Probably too much work, though.

You're best bet is to just train what you have as spam, to counter the
ham rating. Bayes works on tokens (think of it as words) not a complete
message, so this should work if the modification didn't severely harm
the message.

You do not have to --forget a mis-trained message anyway, unless simply
reverting the training is what you want. If you want to correct the
auto-learn and train as --spam, SA will automatically imply the forget
step, if the message has been seen (trained) before. (Of course, that
would not have surfaced the issue of something rewriting your mail in
between.)


> I'll note that this doesn't always happen... sometimes sa-learn can 
> forget mail that I paste in from my MUA.  This time, it's not working.
                     ^^^^^
What do you mean, "paste"!?

SA always needs a full, raw message, including all headers, alternative
parts, and attachments if any. And in particular regarding the sa-learn
message ID, almost every bit counts.


-- 
char *t="\10pse\0r\0dtu\0.@ghno\x4e\xc8\x79\xf4\xab\x51\x8a\x10\xf4\xf4\xc4";
main(){ char h,m=h=*t++,*x=t+2*h,c,i,l=*x,s=0; for (i=0;i<l;i++){ i%8? c<<=1:
(c=*++x); c&128 && (s+=h); if (!(h>>=1)||!t[s+h]){ putchar(t[s]);h=m;s=0; }}}