You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@spamassassin.apache.org by bu...@spamassassin.apache.org on 2021/11/12 03:03:58 UTC

[Bug 7943] New: TxRep gives nonsensical scores?

https://bz.apache.org/SpamAssassin/show_bug.cgi?id=7943

            Bug ID: 7943
           Summary: TxRep gives nonsensical scores?
           Product: Spamassassin
           Version: 3.4.6
          Hardware: PC
                OS: Linux
            Status: NEW
          Severity: normal
          Priority: P2
         Component: Learner
          Assignee: dev@spamassassin.apache.org
          Reporter: mnalis-sabug@voyager.hr
  Target Milestone: Undefined

TxRep seems to return nonsensical scores. I'm using MySQL table if it matters
(as DB files have long ago become unusable to me due to heavy locking &
timeouts).

I've finally taken some time to try to debug it, and first issue was that 3.4.6
was generating many same MSGID tokens
("da39a3ee5e6b4b0d3255bfef95601890afd80709@sa_generated" had count>10 in a few
minutes), which would then get reused by ham and spam because "that mail was
already seen".

(I've partially tracked that problem down to the with how sha1 hash for
"xxxxxx@sa_generated" is created in 3.4.6 - TxRep was using
"Mail::SpamAssassin::Plugin::Bayes->get_msgid()" which seems to be 
case-sensitive and only works for one case of "Message-Id", otherwise it tries
to fall back to using hash of date/body but...) 

Anyway I've seen SVN trunk has changed that part of the code, so I've simply
disabled MSGID tokens with "txrep_track_messages 0" and truncated the txrep
table, hoping that would solve the issue. It did not - it still returned
strange results (spammy score for hams etc.)

I've then tried getting SVN trunk TxRep.pm version, with no luck (it still
worked wrong, and I've had to copy new generate_msgid() to make it work)

I've then nuked the txrep table; added some debug, and start feeding one
clearly ham e-mail several times through "spamassassin -L -t". This is how
mysql table looked for first 5 runs (I'm only focusing on EMAILIP tag here, but
the same problem is with others):

       
+----------+---------------+------+----------+----------+----------+---------------------+
        | username | email         | ip   | msgcount | totscore | signedby |
last_hit            |
       
+----------+---------------+------+----------+----------+----------+---------------------+
1st     | amavis   | hepi@hep.hr   | none |        1 |   -10.21 | spf      |
2021-11-12 03:07:03 |
2nd     | amavis   | hepi@hep.hr   | none |        2 |   -10.21 | spf      |
2021-11-12 03:09:27 |
3rd     | amavis   | hepi@hep.hr   | none |        3 |   -10.21 | spf      |
2021-11-12 03:10:24 |
4th     | amavis   | hepi@hep.hr   | none |        4 |   -10.21 | spf      |
2021-11-12 03:11:17 |
5th     | amavis   | hepi@hep.hr   | none |        5 |   -10.21 | spf      |
2021-11-12 03:12:54 |

I've added following debug just after:
 $delta = ($self->total() + $msgscore) / (1 + $self->count()) - $msgscore;

dbg("TxRep:   mn %s _formula delta = (total()=%0.3f + msgscore=%0.3f) / (1 +
count()=%0.3f) - msgscore=%0.3f = %0.3f", $tag_id, $self->total(), $msgscore,
$self->count(), $msgscore, $delta);


And this is what it printed for that first 5 runs:
dbg: TxRep: mn EMAILIP _formula delta = (total()=0.000 + msgscore=-10.210) / (1
+ count()=0.000) - msgscore=-10.210 = 0.000
dbg: TxRep: mn EMAILIP _formula delta = (total()=-10.210 + msgscore=-10.210) /
(1 + count()=1.000) - msgscore=-10.210 = 0.000
dbg: TxRep: mn EMAILIP _formula delta = (total()=-10.210 + msgscore=-10.210) /
(1 + count()=2.000) - msgscore=-10.210 = 3.403
dbg: TxRep: mn EMAILIP _formula delta = (total()=-10.210 + msgscore=-10.210) /
(1 + count()=3.000) - msgscore=-10.210 = 5.105
dbg: TxRep: mn EMAILIP _formula delta = (total()=-10.210 + msgscore=-10.210) /
(1 + count()=4.000) - msgscore=-10.210 = 6.126

This looks wrong. I've started with TXREP=0 SA score, and after receiving 5 HAM
messages from that sender, TXREP now returns high positive SPAM score:
 3.1 TXREP                  TXREP: Score normalizing based on sender's
reputation

The more HAM I feed it, the higher the SPAM score gets.

I'm thinking $delta is supposed to get slightly more negative with each HAM
that passes through, or at least remain the same, and definitely not start
classifying the email as SPAM. Is my assumption correct? Any idea how $delta
calculation should actually work here?

-- 
You are receiving this mail because:
You are the assignee for the bug.

[Bug 7943] TxRep gives nonsensical scores?

Posted by bu...@spamassassin.apache.org.

https://bz.apache.org/SpamAssassin/show_bug.cgi?id=7943

Paul Stead <pa...@gmail.com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |paul.stead@gmail.com

--- Comment #2 from Paul Stead <pa...@gmail.com> ---
Hi - thanks for taking the time to get all this information together.

I think this could be partly linked to
https://bz.apache.org/SpamAssassin/show_bug.cgi?id=7965

Regarding the +2 - this is as "intended", as coded - the first score is for the
standard reputation, and a second score for the key is the "learned" score - be
this autolearning or manually learned.

https://svn.apache.org/viewvc/spamassassin/trunk/lib/Mail/SpamAssassin/Plugin/TxRep.pm?revision=1896315&view=markup#l1283
and
https://svn.apache.org/viewvc/spamassassin/trunk/lib/Mail/SpamAssassin/Plugin/TxRep.pm?revision=1896315&view=markup#l1504
etc

With this in mind and the recent adjustment to trunk, could you retest your
situation? Feel free to come back with more information to help pinpoint the
issue if these updates don't help

-- 
You are receiving this mail because:
You are the assignee for the bug.

[Bug 7943] TxRep gives nonsensical scores?

Posted by bu...@spamassassin.apache.org.

https://bz.apache.org/SpamAssassin/show_bug.cgi?id=7943

Matija Nalis <mn...@voyager.hr> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |mnalis-sabug@voyager.hr

-- 
You are receiving this mail because:
You are the assignee for the bug.

[Bug 7943] TxRep gives nonsensical scores?

Posted by bu...@spamassassin.apache.org.

https://bz.apache.org/SpamAssassin/show_bug.cgi?id=7943

Sidney Markowitz <si...@sidney.com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
   Target Milestone|Undefined                   |4.0.1
                 CC|                            |sidney@sidney.com

-- 
You are receiving this mail because:
You are the assignee for the bug.

[Bug 7943] TxRep gives nonsensical scores?

Posted by bu...@spamassassin.apache.org.

https://bz.apache.org/SpamAssassin/show_bug.cgi?id=7943

Giovanni Bechis <gi...@paclan.it> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEW                         |RESOLVED
         Resolution|---                         |FIXED

--- Comment #5 from Giovanni Bechis <gi...@paclan.it> ---
Sending        lib/Mail/SpamAssassin/Plugin/TxRep.pm
Transmitting file data .done
Committing transaction...
Committed revision 1909608.

-- 
You are receiving this mail because:
You are the assignee for the bug.

[Bug 7943] TxRep gives nonsensical scores?

Posted by bu...@spamassassin.apache.org.

https://bz.apache.org/SpamAssassin/show_bug.cgi?id=7943

Giovanni Bechis <gi...@paclan.it> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |giovanni@paclan.it

--- Comment #4 from Giovanni Bechis <gi...@paclan.it> ---
Created attachment 5883
  --> https://bz.apache.org/SpamAssassin/attachment.cgi?id=5883&action=edit
Possible fix

Delta formula is:
$delta = ($self->total() + $msgscore) / (1 + $self->count()) - $msgscore;

If we consider the case when:
- TxRep database has 15 emails that matches ($self->count() = 15)
- spam message has score 40 (spam)
- calculated TxRep score is 20 (spam)
- new TxRep score will be (20 + 40) / ( 1 + 15 ) - 40 = -36.25
In this case the spam message will have a total score of 40 - 36.25 = 3.75 and
it won't be flagged as spam.

The attached patch doesn't consider those messages in the delta calculation.

-- 
You are receiving this mail because:
You are the assignee for the bug.

[Bug 7943] TxRep gives nonsensical scores?

Posted by bu...@spamassassin.apache.org.

https://bz.apache.org/SpamAssassin/show_bug.cgi?id=7943

--- Comment #3 from Matija Nalis <mn...@voyager.hr> ---
Thanks Paul for your efforts! 

Unfortunately, I hadn't had a change to try your fix yet, as I've had to drop
TxRep in favor of AWL in early 2022 in order to make production functional
again, and not having time to try to test and bring it back...

However, since AWL with SQL backend also seems buggy, and I'll have to invest
time to rebuild database anyway, I think I might give TxRep another try. Might
be worth doing it before 4.0 gets out, in order to iron out bugs there and save
other people some headaches. 

However, I've found another  bug in SQLBasedAddrList.pm which seems it might be
affecting not only AWL but TxRep as well:
https://bz.apache.org/SpamAssassin/show_bug.cgi?id=8072

Could you take a look there if that would affect TxRep as well?

Also, can I just grab SQLBasedAddrList.pm and TxRep.pm from trunk; or do I have
to go full-trunk (which would be much harder to swallow as I can basically test
it only by deploying it in production) ?

-- 
You are receiving this mail because:
You are the assignee for the bug.

[Bug 7943] TxRep gives nonsensical scores?

Posted by bu...@spamassassin.apache.org.

https://bz.apache.org/SpamAssassin/show_bug.cgi?id=7943

--- Comment #1 from Matija Nalis <mn...@voyager.hr> ---
One observation: it seems that  "totscore" is not always being changed while
"msgcount" is. Should it have been?
Because, if it were changed at the same rate, then that formula *would* keep
delta at zero, e.g.:

dbg: TxRep: mn EMAILIP _formula delta = (total()=0.000 + msgscore=-10.210) / (1
+ count()=0.000) - msgscore=-10.210 = 0.000
dbg: TxRep: mn EMAILIP _formula delta = (total()=-10.210 + msgscore=-10.210) /
(1 + count()=1.000) - msgscore=-10.210 = 0.000
dbg: TxRep: mn EMAILIP _formula delta = (total()=-20.420 + msgscore=-10.210) /
(1 + count()=2.000) - msgscore=-10.210 = 0.000
dbg: TxRep: mn EMAILIP _formula delta = (total()=-30.630 + msgscore=-10.210) /
(1 + count()=3.000) - msgscore=-10.210 = 0.000
dbg: TxRep: mn EMAILIP _formula delta = (total()=-40.840 + msgscore=-10.210) /
(1 + count()=4.000) - msgscore=-10.210 = 0.000


I've seen in code that calling add_score()  is sometimes connected to
(non-default) "txrep_autolearn 1". Enabling autolearn does indeed make
"totscore" change, but in a wrong way too, and also "msgcount" gets increased
by 2 instead of by 1. The miscalculation leading from ham to spam is still
there, even with autolearn enabled though:

+----------+---------------+------+----------+----------+----------+---------------------+
| username | email         | ip   | msgcount | totscore | signedby | last_hit  
         |
+----------+---------------+------+----------+----------+----------+---------------------+
| amavis   | hepi@hep.hr   | none |        2 |   -30.21 | spf      | 2021-11-12
04:41:52 |
| amavis   | hepi@hep.hr   | none |        4 | -23.4033 | spf      | 2021-11-12
04:43:22 |
| amavis   | hepi@hep.hr   | none |        6 |  -22.042 | spf      | 2021-11-12
04:43:58 |
| amavis   | hepi@hep.hr   | none |        8 | -21.4586 | spf      | 2021-11-12
04:44:30 |
| amavis   | hepi@hep.hr   | none |       10 | -21.1344 | spf      | 2021-11-12
04:44:59 |




dbg: TxRep: mn EMAILIP _formula delta = (total()=0.000 + msgscore=-10.210) / (1
+ count()=0.000) - msgscore=-10.210 = 0.000
dbg: TxRep: mn EMAILIP _formula delta = (total()=-30.210 + msgscore=-10.210) /
(1 + count()=2.000) - msgscore=-10.210 = -3.263
dbg: TxRep: mn EMAILIP _formula delta = (total()=-23.403 + msgscore=-10.210) /
(1 + count()=4.000) - msgscore=-10.210 = 3.487
dbg: TxRep: mn EMAILIP _formula delta = (total()=-22.042 + msgscore=-10.210) /
(1 + count()=6.000) - msgscore=-10.210 = 5.603
dbg: TxRep: mn EMAILIP _formula delta = (total()=-21.459 + msgscore=-10.210) /
(1 + count()=8.000) - msgscore=-10.210 = 6.691

 3.3 TXREP                  TXREP: Score normalizing based on sender's
reputation

-- 
You are receiving this mail because:
You are the assignee for the bug.