You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@spamassassin.apache.org by Jerry Malcolm <te...@malcolms.com> on 2019/09/28 17:20:10 UTC

Re: Setting Threshold (Resolved)

On 9/28/2019 9:38 AM, Matus UHLAR - fantomas wrote:
>>> On 28 Sep 2019, at 0:24, Jerry Malcolm wrote:
>>>> Understood.  I'm definitely stopping and starting the spamd 
>>>> service. (Although it's called the spamassassin service, it is 
>>>> definitely starting and stopping spamd.
>>>>
>>>> I've done a ton of digging around.  I located:
>>>>
>>>> /usr/lib/systemd/system/spamassassin.service that starts 
>>>> /usr/bin/spamd using options file /etc/sysconfig/spamassassin and 
>>>> writes the log to /var/log/maillog.
>>>>
>>>> In the maillog it says it is loading options from 
>>>> /var/lib/spamassassin/3.004000/updates_spamassassin_org/local.cf
>>>>
>>>> I checked, and that file has required_score 4.0.  Yet the rest of 
>>>> the log file shows scores of x.x/5.0.
>>>>
>>>> So I tried adding an option --cf=required_score 4.0 to the options 
>>>> file.  No change.
>>>>
>>>> Then I tried adding it directly the spamd invocation in the service 
>>>> file.  No matter how many places I tell it I want 4.0. Something is 
>>>> still overriding it to 5.0.  Any other places you can think of that 
>>>> I can look?
>
>> On 9/27/2019 11:49 PM, Bill Cole wrote:
>>> What are the full command line options for spamd?
>>>
>>> 'ps aux |grep spamd' should tell you the ground truth.
>
> On 28.09.19 00:21, Jerry Malcolm wrote:
>> With my extra parameter added....
>>
>> /usr/bin/perl -T -w /usr/bin/spamd --pidfile /var/run/spamd.pid -D -d 
>> -c -m5 -H --cf=required_score 4.0
>
> the "required_score 4.0" should be enclosed in quotes of apostrophoes.
> Or, in config file.
>
> further, the empty -H changes how configs are used:
>
>    "By specifying no argument, spamd will use the spamc caller's home 
> directory
>           instead."
>
> so, the calling user $HOME/.spamassassin/user_prefs is used

Matus,

Apparently, the whole problem was the quotes.  I added the quotes to the 
command line options, and it finally works.  I didn't try adding quotes 
in the local.cf file.  But it makes sense.  Note though, that the 
commented "required_score" line in the shipped version of local.cf does 
not have quotes.  Perhaps quotes should get added to that file in the 
distribution if they are required.

So now at least I know how to set the threshold.  But my original 
question has spawned a separate discussion of whether it is the right 
thing do to change the threshold.   I got one suggestion that, rather 
than reducing the threshold, I go in and rework the scoring on all of 
the rules in order to get my scores for obvious spam to rank above 5.0.  
I appreciate all of the work and knowledge by the SA team and 
contributors that has gone into refining the scoring on all of the 
rules.  If I don't have enough background to correctly lower the 
threshold, I definitely don't have the background and experience (or 
time) to rework the scoring on a thousand rules.

So the real question is.... why are MY scores on spam apparently lower 
than the main population of SA users?  I gotta believe that most users 
are processing emails just fine with a 5.0 threshold and not getting 
tons of uncaught spam.  I have added KAM.cf.  But I still a large 
percentage of spam gets scored between 4 and 5.  I understand that there 
are a billion different strains of spam and the spam that user X 
receives is different that the spam that user Y receives.  But my lower 
scores seem a bit too consistent for that to be the only problem.

Just curious you have a set of test cases that have an expected spam 
score that I could run through my SA and compare, and maybe isolate what 
rules might not be firing for me.

This is going to be an ongoing research problem for me. Not a 
show-stopper today.  But I would like to understand better about my 
situation.  I want to use SA as intended.

Thanks again,

Jerry


Re: Setting Threshold (Resolved)

Posted by Matus UHLAR - fantomas <uh...@fantomas.sk>.
>>On 28.09.19 00:21, Jerry Malcolm wrote:
>>>With my extra parameter added....
>>>
>>>/usr/bin/perl -T -w /usr/bin/spamd --pidfile /var/run/spamd.pid -D 
>>>-d -c -m5 -H --cf=required_score 4.0

>On 9/28/2019 9:38 AM, Matus UHLAR - fantomas wrote:
>>the "required_score 4.0" should be enclosed in quotes of apostrophoes.
>>Or, in config file.
>>
>>further, the empty -H changes how configs are used:
>>
>>   "By specifying no argument, spamd will use the spamc caller's 
>>home directory
>>          instead."
>>
>>so, the calling user $HOME/.spamassassin/user_prefs is used

On 28.09.19 12:20, Jerry Malcolm wrote:
>Apparently, the whole problem was the quotes.  I added the quotes to 
>the command line options, and it finally works.  I didn't try adding 
>quotes in the local.cf file.  But it makes sense.  Note though, that 
>the commented "required_score" line in the shipped version of local.cf 
>does not have quotes.  Perhaps quotes should get added to that file in 
>the distribution if they are required.

No.

Quotes must be in startup file, because "required_score 4.0" without quotes
in the command line is understood as two separate arguments, while you need
one argument.

It's different in config file, quotes don't belong there.

>So now at least I know how to set the threshold.  But my original 
>question has spawned a separate discussion of whether it is the right 
>thing do to change the threshold.   I got one suggestion that, rather 
>than reducing the threshold, I go in and rework the scoring on all of 
>the rules in order to get my scores for obvious spam to rank above 
>5.0. 

No.
Playing with scores is often even worse because scores are balanced
automatically, increasing either could increase false positives.

First you should ask why you only get those scores.
There are plugins like razor2, pyzor, DCC, that can increase scores
dramatically.

Also, using BAYES database helps much, although it requires training.

Since you use the -H parameter above, your users will have each own database
and will need to train it themselves.

>So the real question is.... why are MY scores on spam apparently lower 
>than the main population of SA users? 

there are always some false negatives. Spammers try hard.

-- 
Matus UHLAR - fantomas, uhlar@fantomas.sk ; http://www.fantomas.sk/
Warning: I wish NOT to receive e-mail advertising to this address.
Varovanie: na tuto adresu chcem NEDOSTAVAT akukolvek reklamnu postu.
I just got lost in thought. It was unfamiliar territory.

Re: Setting Threshold (Resolved)

Posted by Bill Cole <sa...@billmail.scconsult.com>.
On 28 Sep 2019, at 13:20, Jerry Malcolm wrote:

> On 9/28/2019 9:38 AM, Matus UHLAR - fantomas wrote:
>>>> On 28 Sep 2019, at 0:24, Jerry Malcolm wrote:
>>>>> Understood.  I'm definitely stopping and starting the spamd 
>>>>> service. (Although it's called the spamassassin service, it is 
>>>>> definitely starting and stopping spamd.
>>>>>
>>>>> I've done a ton of digging around.  I located:
>>>>>
>>>>> /usr/lib/systemd/system/spamassassin.service that starts 
>>>>> /usr/bin/spamd using options file /etc/sysconfig/spamassassin and 
>>>>> writes the log to /var/log/maillog.
>>>>>
>>>>> In the maillog it says it is loading options from 
>>>>> /var/lib/spamassassin/3.004000/updates_spamassassin_org/local.cf
>>>>>
>>>>> I checked, and that file has required_score 4.0.  Yet the rest of 
>>>>> the log file shows scores of x.x/5.0.
>>>>>
>>>>> So I tried adding an option --cf=required_score 4.0 to the options 
>>>>> file.  No change.
>>>>>
>>>>> Then I tried adding it directly the spamd invocation in the 
>>>>> service file.  No matter how many places I tell it I want 4.0. 
>>>>> Something is still overriding it to 5.0.  Any other places you 
>>>>> can think of that I can look?
>>
>>> On 9/27/2019 11:49 PM, Bill Cole wrote:
>>>> What are the full command line options for spamd?
>>>>
>>>> 'ps aux |grep spamd' should tell you the ground truth.
>>
>> On 28.09.19 00:21, Jerry Malcolm wrote:
>>> With my extra parameter added....
>>>
>>> /usr/bin/perl -T -w /usr/bin/spamd --pidfile /var/run/spamd.pid -D 
>>> -d -c -m5 -H --cf=required_score 4.0
>>
>> the "required_score 4.0" should be enclosed in quotes of 
>> apostrophoes.
>> Or, in config file.
>>
>> further, the empty -H changes how configs are used:
>>
>>    "By specifying no argument, spamd will use the spamc caller's 
>> home directory
>>           instead."
>>
>> so, the calling user $HOME/.spamassassin/user_prefs is used
>
> Matus,
>
> Apparently, the whole problem was the quotes.  I added the quotes to 
> the command line options, and it finally works.  I didn't try adding 
> quotes in the local.cf file.  But it makes sense.  Note though, that 
> the commented "required_score" line in the shipped version of local.cf 
> does not have quotes.  Perhaps quotes should get added to that file 
> in the distribution if they are required.

They are not required in a config file. They are only required on a 
command line.

> So now at least I know how to set the threshold. 

You've found one way, but there's still the puzzle of which config file 
is actually being used by spamd, since you changed the threshold in some 
file that was clearly NOT the operative local.cf.

> But my original question has spawned a separate discussion of whether 
> it is the right thing do to change the threshold.   I got one 
> suggestion that, rather than reducing the threshold, I go in and 
> rework the scoring on all of the rules in order to get my scores for 
> obvious spam to rank above 5.0.  I appreciate all of the work and 
> knowledge by the SA team and contributors that has gone into refining 
> the scoring on all of the rules.  If I don't have enough background 
> to correctly lower the threshold, I definitely don't have the 
> background and experience (or time) to rework the scoring on a 
> thousand rules.

The default rules, scores, and threshold are not Holy Writ. There is an 
automated process backed by human classification of ham and spam corpora 
which calculates some rule scores with an assumption of 5 as the 
threshold, but I can guarantee that those corpora are not representative 
of all mail, of all mail seen by SA, or of all mail handled by any 
single system. It is almost certainly true that the SA defaults are not 
the best possible fit for any site anywhere, they're just the best 
compromise we know how to come up with. In creating rules and 
determining whether they are good enough to publish, we have a 
substantial bias against false positives, inevitably meaning that SA 
will have some false negatives.

Adjusting the threshold is definitely the easiest way to deal with SA 
making too many mistakes on one side of the threshold or the other. In 
my experience, 4.0 is a reasonable level AFTER you've got Bayes and AWL 
or TxRep databases trained.

> So the real question is.... why are MY scores on spam apparently lower 
> than the main population of SA users?  I gotta believe that most 
> users are processing emails just fine with a 5.0 threshold and not 
> getting tons of uncaught spam.  I have added KAM.cf. 

Are you sure that your spamd is actually using the KAM.cf rules? I ask 
because of the unresolved question of what config files it is actually 
using.

> But I still a large percentage of spam gets scored between 4 and 5.  
> I understand that there are a billion different strains of spam and 
> the spam that user X receives is different that the spam that user Y 
> receives.  But my lower scores seem a bit too consistent for that to 
> be the only problem.

I've worked with a lot of different mail streams and I think it is 
absolutely normal for a site to have that sort of tilt, especially one 
with a small number of users.

> Just curious you have a set of test cases that have an expected spam 
> score that I could run through my SA and compare, and maybe isolate 
> what rules might not be firing for me.

We do not publish test cases because there is really no hope of coming 
up with significant coverage in a reasonable number of test cases. The 
most common sources of excess false negatives are entirely local issues 
such as correctly set *_networks values and having a proper independent 
DNS resolver set up so that you can use the "free for most" DNSBL and 
URIBL services that block the heaviest users by resolver address.

It is fairly common for people with persistent false negative problems 
to ask about them here, usually posting the spam samples to PasteBin to 
avoid having messages to the list blocked as spam.

> This is going to be an ongoing research problem for me. Not a 
> show-stopper today.  But I would like to understand better about my 
> situation.  I want to use SA as intended.

As a member of the SpamAssassin PMC I think that I'm safe in saying that 
the only "as intended" use is "whatever works for your particular 
circumstances."

-- 
Bill Cole
bill@scconsult.com or billcole@apache.org
(AKA @grumpybozo and many *@billmail.scconsult.com addresses)