You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@spamassassin.apache.org by bu...@bugzilla.spamassassin.org on 2019/01/01 21:56:00 UTC

[Bug 7674] New: sa-learn learns all messages as ham even if --spam is specified

https://bz.apache.org/SpamAssassin/show_bug.cgi?id=7674

            Bug ID: 7674
           Summary: sa-learn learns all messages as ham even if --spam is
                    specified
           Product: Spamassassin
           Version: 3.4.2
          Hardware: PC
                OS: Linux
            Status: NEW
          Severity: normal
          Priority: P2
         Component: Learner
          Assignee: dev@spamassassin.apache.org
          Reporter: ralfglauberman@gmx.de
  Target Milestone: Undefined

While learning messages with "sa-learn --spam" from a folder the messages are
in fact learned as ham instead of spam. 

Debug log:
Jan  1 22:42:12.185 [19522] dbg: bayes: learner_new
self=Mail::SpamAssassin::Plugin::Bayes=HASH(0x56193c244a38),
bayes_store_module=Mail::SpamAssassin::BayesStore::SQL
Jan  1 22:42:12.204 [19522] dbg: bayes: using username: XXX
Jan  1 22:42:12.204 [19522] dbg: bayes: learner_new: got
store=Mail::SpamAssassin::BayesStore::SQL=HASH(0x56193cd9a998)
Jan  1 22:42:12.217 [19522] dbg: bayes: database connection established
Jan  1 22:42:12.218 [19522] dbg: bayes: found bayes db version 3
Jan  1 22:42:12.218 [19522] dbg: bayes: Using userid: 4
Jan  1 22:42:12.219 [19522] dbg: bayes: not available for scanning, only 0
spam(s) in bayes DB < 200
Jan  1 22:42:12.221 [19522] dbg: sa-learn: spamtest initialized
Jan  1 22:42:12.221 [19522] dbg: learn: initializing learner
Jan  1 22:42:12.221 [19522] dbg: bayes: bayes journal sync starting
Jan  1 22:42:12.221 [19522] dbg: bayes: bayes journal sync completed
Jan  1 22:42:12.221 [19522] dbg: bayes: expiry starting
Jan  1 22:42:12.222 [19522] dbg: bayes: database connection established
Jan  1 22:42:12.222 [19522] dbg: bayes: found bayes db version 3
Jan  1 22:42:12.223 [19522] dbg: bayes: Using userid: 4
Jan  1 22:42:12.234 [19522] dbg: bayes: DB expiry: tokens in DB: 430, Expiry
max size: 150000, Oldest atime: 1546240630, Newest atime: 1546309979, Last
expire: 0, Current time: 1546378932
Jan  1 22:42:12.236 [19522] dbg: bayes: expiry completed
Jan  1 22:42:12.238 [19522] dbg: learn: learning ham
Jan  1 22:42:12.258 [19522] dbg: bayes: tokenized body: 3 tokens
Jan  1 22:42:12.258 [19522] dbg: bayes: tokenized uri: 0 tokens
Jan  1 22:42:12.258 [19522] dbg: bayes: tokenized invisible: 0 tokens
Jan  1 22:42:12.261 [19522] dbg: bayes: tokenized header: 159 tokens
Jan  1 22:42:12.355 [19522] dbg: bayes: seen
(6fbb589c1d2d27cf8a150d8345ff08c53ec827fa@sa_generated) put
Jan  1 22:42:12.356 [19522] dbg: bayes: learned
'6fbb589c1d2d27cf8a150d8345ff08c53ec827fa@sa_generated', atime: 1546309979

Note the line "dbg: learn: learning ham"

Numbers for nham/nspam from from "sa-learn --dump magic" confirm the message is
in fact learned as ham and not as spam as intended.

Messages seem to be learned correctly if learned via autolearn instead of
sa-learn script.

Bayes data is stored in a MySQL database backend if this should be relevant.

System is Gentoo Linux x86_64 with the latest distribution SpamAssassin package
(spamassassin-3.4.2-r2).

-- 
You are receiving this mail because:
You are the assignee for the bug.

[Bug 7674] sa-learn learns all messages as ham even if --spam is specified

Posted by bu...@bugzilla.spamassassin.org.

https://bz.apache.org/SpamAssassin/show_bug.cgi?id=7674

Ralf Glauberman <ra...@gmx.de> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |ralfglauberman@gmx.de

--- Comment #2 from Ralf Glauberman <ra...@gmx.de> ---
Thanks for your response. I have looked into it a bit more and as it seems the
learner is fine but the command line parsing is not as I would have expected:

sa-learn test.eml --debug
--> Error reported (since no command is specified)

sa-learn --spam test.eml --debug
--> Mail is correctly learned as spam

sa-learn test.eml --spam --debug
--> Mail is learned as ham and no error is reported

I assumed that the order of parameters would be unimportant and I think it is a
bit confusing that the parameter is on one hand detected (no error is reported)
but on the other hand ignored. Maybe a simple sanity check could be added?

-- 
You are receiving this mail because:
You are the assignee for the bug.

[Bug 7674] sa-learn learns all messages as ham even if --spam is specified

Posted by bu...@spamassassin.apache.org.

https://bz.apache.org/SpamAssassin/show_bug.cgi?id=7674

Sidney Markowitz <si...@sidney.com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
   Target Milestone|Undefined                   |4.0.0
         Resolution|---                         |FIXED
             Status|NEW                         |RESOLVED

--- Comment #7 from Sidney Markowitz <si...@sidney.com> ---
Committed revision 1899917.

sa-learn spampath --spam   now produces a usage error instead of running with
unexpected results.

Also improved usage and perldoc documentation, including the use of -D as
mentioned in bug 7675

-- 
You are receiving this mail because:
You are the assignee for the bug.

[Bug 7674] sa-learn learns all messages as ham even if --spam is specified

Posted by bu...@bugzilla.spamassassin.org.

https://bz.apache.org/SpamAssassin/show_bug.cgi?id=7674

--- Comment #4 from Ralf Glauberman <ra...@gmx.de> ---
Sorry for the delay, needed to debug it in more detail...

I still don't know why you are unable to reproduce the bug but i am sure it is
not related to the debug switch. I was able to reproduce the problem with a
clean install and default konfiguration (i.e. no MySQL or anything). To debug
the problem I added the following line to sa-learn in the wanted function (at
about line 576):

+  warn "learning $id as $class\n";
   my $status = $spamtest->learn( $ma, undef, $spam, $forget );

Executing the following command then returns:

./bin/sa-learn --spam test.eml --ham test2.eml --spam test3.eml
learning test.eml as s
learning test3.eml as s
learning test2.eml as h
Learned tokens from 2 message(s) (3 message(s) examined)

=> Works as intended

In order to be able to learn spam and ham messages during one execution of the
command, the command line is parsed and each message to learn is added to the
targets array by the target function. The function uses the global isspam
variable to determine if the message should be learned as spam or as ham. This
variable is set whenever a --spam/--ham command line parameter is read by
GetOptions. This means however that if a message file name is found on the
command line before any --spam/--ham flag, the isspam variable has never been
initialized by the time target is called and the behavior is therefore
undefined. The statement "my $class = ( $isspam ? "spam" : "ham" );" results in
the message being learned as ham.

./bin/sa-learn test.eml --ham test2.eml --spam test3.eml
learning test3.eml as s
learning test.eml as h
learning test2.eml as h
Learned tokens from 2 message(s) (3 message(s) examined)

Or with just one message:
./bin/sa-learn test.eml --spam
learning test.eml as h
Learned tokens from 0 message(s) (1 message(s) examined)

I think the program should check that --spam/--ham has been seen on the command
line before any message file name and the documentation should be updated so it
is clear that both spam and ham can be learned during a single execution but
the relevant flag has to be used before the file name.

Sorry for not providing a patch but I don't know perl (read only).

-- 
You are receiving this mail because:
You are the assignee for the bug.

[Bug 7674] sa-learn learns all messages as ham even if --spam is specified

Posted by bu...@spamassassin.apache.org.

https://bz.apache.org/SpamAssassin/show_bug.cgi?id=7674

Benny Pedersen <me...@junc.eu> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |me@junc.eu

--- Comment #8 from Benny Pedersen <me...@junc.eu> ---
also --forget will remove digest that is learned by --ham --spam

dont know why parsing opts would be missed on parse :/

-- 
You are receiving this mail because:
You are the assignee for the bug.

[Bug 7674] sa-learn learns all messages as ham even if --spam is specified

Posted by bu...@bugzilla.spamassassin.org.

https://bz.apache.org/SpamAssassin/show_bug.cgi?id=7674

Bill Cole <bi...@apache.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
           Hardware|PC                          |All
                 OS|Linux                       |All

--- Comment #3 from Bill Cole <bi...@apache.org> ---
I still can't reproduce the exact problem with the given command line, so that
is apparently an artifact of the storage backend, the configuration, or the
input. It may be helpful to take this problem to the SpamAssassin Users mailing
list, where others with a diverse range of configurations can assist. 

HOWEVER: I will not (yet) unilaterally close this bug as "WORKSFORME" because
even though I have been unable to reproduce the behavior, I strongly suspect
that it is related to bug 7675, an issue that people have been working around
practically forever rather than properly documenting and/or fixing. 

*** WORKAROUNDS *** 

Always give the -D/--debug option an explicit set of debug channels: either
'all' or a comma-delimited list.

-- 
You are receiving this mail because:
You are the assignee for the bug.

[Bug 7674] sa-learn learns all messages as ham even if --spam is specified

Posted by bu...@bugzilla.spamassassin.org.

https://bz.apache.org/SpamAssassin/show_bug.cgi?id=7674

Bill Cole <bi...@apache.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |billcole@apache.org

--- Comment #1 from Bill Cole <bi...@apache.org> ---
I cannot reproduce this with flat-file (BDB 5.3) Bayes. 

What is the exact command line you are using to run sa-learn and generate that
log?

-- 
You are receiving this mail because:
You are the assignee for the bug.

[Bug 7674] sa-learn learns all messages as ham even if --spam is specified

Posted by bu...@spamassassin.apache.org.

https://bz.apache.org/SpamAssassin/show_bug.cgi?id=7674

Sidney Markowitz <si...@sidney.com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |sidney@sidney.com

--- Comment #5 from Sidney Markowitz <si...@sidney.com> ---
I think that documenting that --ham is the default and that the --ham and
--spam options apply to any files following on the command line and can appear
multiple times on the command line would be a sufficient fix. I don't see a
need for making it a required option with no default.

-- 
You are receiving this mail because:
You are the assignee for the bug.

[Bug 7674] sa-learn learns all messages as ham even if --spam is specified

Posted by bu...@spamassassin.apache.org.

https://bz.apache.org/SpamAssassin/show_bug.cgi?id=7674

--- Comment #6 from Sidney Markowitz <si...@sidney.com> ---
After experimenting, I've changed my mind. sa-learn errors out if you give it a
file name on the command line without specifying either --ham or --spam so it
is not correct to say that --ham is the default as if it is optional. However
the command "sa-learn test.eml --spam" would learn the file as ham. That makes
it too counterintuitive to leave as is. I know think that it should be an error
when the flag is undefined and the documentation should be explicit that --ham
and --spam affect the files named after them in the command line.

-- 
You are receiving this mail because:
You are the assignee for the bug.

[Bug 7674] sa-learn learns all messages as ham even if --spam is specified

Posted by bu...@spamassassin.apache.org.

https://bz.apache.org/SpamAssassin/show_bug.cgi?id=7674

Henrik Krohns <ap...@hege.li> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |apache@hege.li

--- Comment #9 from Henrik Krohns <ap...@hege.li> ---
All the same stuff was needed for --forget:

Sending        trunk/sa-learn.raw
Sending        trunk/t/bayesbdb.t
Sending        trunk/t/bayesdbm.t
Sending        trunk/t/bayessql.t
Transmitting file data ....done
Committing transaction...
Committed revision 1900137.

-- 
You are receiving this mail because:
You are the assignee for the bug.