You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@spamassassin.apache.org by Reindl Harald <h....@thelounge.net> on 2016/05/13 13:42:07 UTC

FSL_HELO_HOME: deep headers again

WTF - Received: from daves-air.home ([1.125.7.92]) is another time a 
DEEP HEADER Inspection - What about score not well thought rules which 
are even not worth a decription not higher than 0.5?

3.7 FSL_HELO_HOME          No description available
score FSL_HELO_HOME        2.641 3.722 2.641 3.722

AND YES IT WAS A FALSE-POSITIVE

Re: FSL_HELO_HOME: deep headers again

Posted by Reindl Harald <h....@thelounge.net>.


Am 13.05.2016 um 15:42 schrieb Reindl Harald:
> WTF - Received: from daves-air.home ([1.125.7.92]) is another time a
> DEEP HEADER Inspection - What about score not well thought rules which
> are even not worth a decription not higher than 0.5?
>
> 3.7 FSL_HELO_HOME          No description available
> score FSL_HELO_HOME        2.641 3.722 2.641 3.722
>
> AND YES IT WAS A FALSE-POSITIVE

looks it was introduced with one of the few updates in this month and 
from May 12 06:21:46 until now it hitted 6 *100% ham messages* and not a 
single spam while the last one was even rejcted because of the 3.7 points

again: Auto-QA does *not* work and who is that "FSL" which writes all 
the time deep-header rules without a sensible max-score?

08-Mai-2016 01:38:23: SpamAssassin: No update available
09-Mai-2016 00:02:56: SpamAssassin: No update available
10-Mai-2016 01:10:29: SpamAssassin: No update available
11-Mai-2016 00:55:46: SpamAssassin: No update available
12-Mai-2016 00:21:17: SpamAssassin: Update processed successfully
13-Mai-2016 00:33:31: SpamAssassin: No update available

Re: FSL_HELO_HOME: deep headers again

Posted by John Hardin <jh...@impsec.org>.

On Thu, 26 May 2016, Reindl Harald wrote:

> Am 13.05.2016 um 18:18 schrieb John Hardin:
>>  On Fri, 13 May 2016, RW wrote:
>> 
>> >  On Fri, 13 May 2016 15:42:07 +0200
>> >  Reindl Harald wrote:
>> > 
>> > >  WTF - Received: from daves-air.home ([1.125.7.92]) is another time a
>> > >  DEEP HEADER Inspection -
>> > 
>> >  This looks like a simple mistake rather than a deliberate attempt at a
>> >  deep check. You should file a bug report.
>>
>>  Please don't. The rule has been disabled
>
> has it?

At the time I wrote that I'd looked at the sandbox file and the rule had 
been commented out.

http://svn.apache.org/viewvc/spamassassin/trunk/rulesrc/sandbox/maddoc/99_doc_test.cf?r1=1726846&r2=1743683&sortby=date&diff_format=h

Checking SVN shows it has not been reenabled.

http://svn.apache.org/viewvc/spamassassin/trunk/rulesrc/sandbox/maddoc/99_doc_test.cf?diff_format=h&sortby=date&view=log

There hasn't been a rules update since that change - the last update 
covers through revision 1743621, just before that change. It looks like 
the corpora are large enough, it seems to be a timing issue now.

A bug report about this rule would either modify or disable the rule, 
which has already been done, but wouldn't cause an update to be delivered 
any sooner than the masscheck corpora allow.

-- 
  John Hardin KA7OHZ                    http://www.impsec.org/~jhardin/
  jhardin@impsec.org    FALaholic #11174     pgpk -a jhardin@impsec.org
  key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
-----------------------------------------------------------------------
   Public Education: the bureaucratic process of replacing
   an empty mind with a closed one.                          -- Thorax
-----------------------------------------------------------------------
  4 days until Memorial Day - honor those who sacrificed for our liberty

Re: FSL_HELO_HOME: deep headers again

Posted by Reindl Harald <h....@thelounge.net>.


Am 13.05.2016 um 18:18 schrieb John Hardin:
> On Fri, 13 May 2016, RW wrote:
>
>> On Fri, 13 May 2016 15:42:07 +0200
>> Reindl Harald wrote:
>>
>>> WTF - Received: from daves-air.home ([1.125.7.92]) is another time a
>>> DEEP HEADER Inspection -
>>
>> This looks like a simple mistake rather than a deliberate attempt at a
>> deep check. You should file a bug report.
>
> Please don't. The rule has been disabled

has it?

May 24 14:57:15 mail-gw spamd[17055]: spamd: result: . -3 - 
BAYES_00,CUST_DNSWL_7_ORG_LOW,CUST_DNSWL_8_TL_NT,FSL_HELO_HOME,HTML_MESSAGE,SPF_NONE 
scantime=2.3,size=39898,user=sa-milt,uid=189,required_score=5.5,rhost=localhost,raddr=127.0.0.1,rport=/run/spamassassin/spamassassin.sock,mid=<A3...@warga-hack.at>,bayes=0.000000,autolearn=disabled,shortcircuit=no
May 24 16:26:51 mail-gw spamd[21920]: spamd: result: . -1 - 
BAYES_20,CUST_DNSWL_2_SENDERSC_LOW,CUST_DNSWL_7_ORG_LOW,CUST_DNSWL_8_TL_NT,DKIM_SIGNED,DKIM_VALID,FSL_HELO_HOME,HTML_MESSAGE,RCVD_IN_MSPIKE_H3,RCVD_IN_MSPIKE_WL,SPF_NONE 
scantime=1.9,size=7282,user=sa-milt,uid=189,required_score=5.5,rhost=localhost,raddr=127.0.0.1,rport=/run/spamassassin/spamassassin.sock,mid=<87...@franzwine.com>,bayes=0.176735,autolearn=disabled,shortcircuit=no
May 24 18:55:13 mail-gw spamd[2470]: spamd: result: . -3 - 
BAYES_00,CUST_BODY_BEGINS_VL,CUST_DNSWL_8_TL_NT,FSL_HELO_HOME,HTML_MESSAGE,SPF_NONE 
scantime=3.8,size=168647,user=sa-milt,uid=189,required_score=5.5,rhost=localhost,raddr=127.0.0.1,rport=/run/spamassassin/spamassassin.sock,mid=<00...@saeco-professional.pl>,bayes=0.000000,autolearn=disabled,shortcircuit=no
May 24 23:37:48 mail-gw spamd[11907]: spamd: result: . 2 - 
BAYES_00,CUST_DNSBL_20_SORBS_SPAM,CUST_DNSBL_30_SENDERSC_MED,CUST_DNSBL_34_BACKSCATTER,CUST_DNSWL_7_ORG_LOW,CUST_DNSWL_8_TL_NT,DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,FREEMAIL_FROM,FSL_HELO_HOME,HTML_IMAGE_RATIO_02,HTML_MESSAGE,MISSING_HEADERS,RCVD_IN_MSPIKE_H3,RCVD_IN_MSPIKE_WL,SPF_PASS 
scantime=2.5,size=955522,user=sa-milt,uid=189,required_score=5.5,rhost=localhost,raddr=127.0.0.1,rport=/run/spamassassin/spamassassin.sock,mid=<5A...@gmail.com>,bayes=0.000000,autolearn=disabled,shortcircuit=no
May 25 07:03:18 mail-gw spamd[833]: spamd: result: . -5 - 
BAYES_00,CUST_DNSWL_12_TL_MED,CUST_DNSWL_2_SENDERSC_LOW,DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,FSL_HELO_HOME,RP_MATCHES_RCVD,SPF_PASS 
scantime=2.6,size=2243,user=sa-milt,uid=189,required_score=5.5,rhost=localhost,raddr=127.0.0.1,rport=/run/spamassassin/spamassassin.sock,mid=<2A...@me.com>,bayes=0.000000,autolearn=disabled,shortcircuit=no
May 25 11:46:26 mail-gw spamd[7557]: spamd: result: . -1 - 
BAYES_00,CUST_DNSBL_17_SPAMCANNIBAL,CUST_DNSBL_26_NSZONES,CUST_DNSBL_34_BACKSCATTER,CUST_DNSWL_7_ORG_LOW,CUST_DNSWL_8_TL_NT,FSL_HELO_HOME,HTML_MESSAGE,RCVD_IN_MSPIKE_H3,RCVD_IN_MSPIKE_WL,SPF_NONE 
scantime=3.7,size=20552,user=sa-milt,uid=189,required_score=5.5,rhost=localhost,raddr=127.0.0.1,rport=/run/spamassassin/spamassassin.sock,mid=<4D...@komma.cc>,bayes=0.000000,autolearn=disabled,shortcircuit=no
May 25 11:54:57 mail-gw spamd[25762]: spamd: result: . -3 - 
BAYES_00,CUST_DNSWL_8_TL_NT,FSL_HELO_HOME,HTML_MESSAGE,SPF_NONE,T_KAM_HTML_FONT_INVALID 
scantime=2.6,size=923615,user=sa-milt,uid=189,required_score=5.5,rhost=localhost,raddr=127.0.0.1,rport=/run/spamassassin/spamassassin.sock,mid=<7D...@intel>,bayes=0.000000,autolearn=disabled,shortcircuit=no
May 25 16:33:20 mail-gw spamd[26346]: spamd: result: . -3 - 
BAYES_00,CUST_DNSWL_7_ORG_LOW,CUST_DNSWL_8_TL_NT,FREEMAIL_FROM,FSL_HELO_HOME,HTML_MESSAGE,RCVD_IN_MSPIKE_H3,RCVD_IN_MSPIKE_WL,RP_MATCHES_RCVD,SPF_NONE 
scantime=5.8,size=74496,user=sa-milt,uid=189,required_score=5.5,rhost=localhost,raddr=127.0.0.1,rport=/run/spamassassin/spamassassin.sock,mid=<6A...@aon.at>,bayes=0.000000,autolearn=disabled,shortcircuit=no

Re: FSL_HELO_HOME: deep headers again

Posted by John Hardin <jh...@impsec.org>.

On Fri, 13 May 2016, RW wrote:

> On Fri, 13 May 2016 15:42:07 +0200
> Reindl Harald wrote:
>
>> WTF - Received: from daves-air.home ([1.125.7.92]) is another time a
>> DEEP HEADER Inspection -
>
> This looks like a simple mistake rather than a deliberate attempt at a
> deep check. You should file a bug report.

Please don't. The rule has been disabled.

-- 
  John Hardin KA7OHZ                    http://www.impsec.org/~jhardin/
  jhardin@impsec.org    FALaholic #11174     pgpk -a jhardin@impsec.org
  key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
-----------------------------------------------------------------------
   ...much of our country's counterterrorism security spending is not
   designed to protect us from the terrorists, but instead to protect
   our public officials from criticism when another attack occurs.
                                                     -- Bruce Schneier
-----------------------------------------------------------------------
  143 days since the first successful real return to launch site (SpaceX)

Re: FSL_HELO_HOME: deep headers again

Posted by Reindl Harald <h....@thelounge.net>.


Am 13.05.2016 um 23:08 schrieb Tom Hendrikx:
> On 13-05-16 18:29, Reindl Harald wrote:
>> especially you would not have much from the bayes-samples because they
>> would trigger all sort of wrong rules after strip most headers and and a
>> generic received header (which seems to be needed by the bayes-engine
>> for whatever reason since it otherwise scores samples completly different)
>
> This is an assumption: you can't know what your data would contribute to
> the masscheck process

this is *not* an assumption - the setup is maintained in a way that i 
don't have to make many assumptions at all

i run tools for corpus-files and downloads to pass them through SA and 
see regulary all sort of rules hit on stripped samples which would not 
hit on the untouched email

guess what remains with a 2292 lines "bayes_ignore_header" which is also 
used to strip messages with formail compared to the original ones

the reason is that we maintain a real huge bayes which is intended only 
to contain body and a few headers, otherwise 90000 samples would not 
only take 800 MB stoarge and result "only" 2818486 token

why?

because we keep samples and bayes forever while train every spam message 
below BAYES_99 and every ham message >= BAYES_50 to keep the option 
rebuild from scratch at any point in time (tokenizer-changes in future 
versions, maybe more-word-tokens in future versions or if needed switch 
to a different solution without start collect from scratch)

Re: FSL_HELO_HOME: deep headers again

Posted by Tom Hendrikx <to...@whyscream.net>.

On 13-05-16 18:29, Reindl Harald wrote:
> 
> Am 13.05.2016 um 18:11 schrieb John Hardin:
>> On Fri, 13 May 2016, Reindl Harald wrote:
>>
>>> the problem is blowing out such rules with such scores at all with a
>>> non working auto-QA (non-working in: no correction for days as well as
>>> dangerous scoring of new rules from the start)
>>>
>>> 02-Mai-2016 00:12:34: SpamAssassin: No update available
>>> 03-Mai-2016 01:55:05: SpamAssassin: No update available
>>> 04-Mai-2016 00:43:33: SpamAssassin: No update available
>>> 05-Mai-2016 01:48:15: SpamAssassin: Update processed successfully
>>> 06-Mai-2016 00:53:17: SpamAssassin: No update available
>>> 07-Mai-2016 01:21:23: SpamAssassin: No update available
>>> 08-Mai-2016 01:38:23: SpamAssassin: No update available
>>> 09-Mai-2016 00:02:56: SpamAssassin: No update available
>>> 10-Mai-2016 01:10:29: SpamAssassin: No update available
>>> 11-Mai-2016 00:55:46: SpamAssassin: No update available
>>> 12-Mai-2016 00:21:17: SpamAssassin: Update processed successfully
>>> 13-Mai-2016 00:33:31: SpamAssassin: No update available
>>
>> Perhaps you could help with that by participating in masscheck. You seem
>> to get a lot of FPs on base rules; contributing masscheck results on
>> your ham would reduce those
> 
> i can't rsync customer mails to a 3rd party

That is not necessary for masscheck.
> 
> if that would be based on some webervice where you just feed local
> samples and only give the rules which hitted and spam/ham flag out it
> would be somehow possible

The process is clearly documented on the wiki:
https://wiki.apache.org/spamassassin/MassCheck
> 
> especially you would not have much from the bayes-samples because they
> would trigger all sort of wrong rules after strip most headers and and a
> generic received header (which seems to be needed by the bayes-engine
> for whatever reason since it otherwise scores samples completly different)

This is an assumption: you can't know what your data would contribute to
the masscheck process.
> 
> in any case: such a rule with 3.7 must not happen at all, even if it has
> no such bad impact - 3.7 is very high and only deserved when you are
> certain that a mail is spam which is *not* backed by a single header,
> deep inspection or not
> 
That is true, but I think you should put your money where your mouth is:
just run the masscheck on your corpus and send the results to the devs
for inspection. If it's not working, you lost nothing. If the data *is*
useful, we all win from your work by getting better scores.

Just my 2 cents.
Regards,
	Tom

Re: FSL_HELO_HOME: deep headers again

Posted by John Hardin <jh...@impsec.org>.

On Sat, 14 May 2016, Reindl Harald wrote:

> Am 14.05.2016 um 19:10 schrieb John Hardin:
>>  On Sat, 14 May 2016, Reindl Harald wrote:
>> 
>> >  Am 14.05.2016 um 04:50 schrieb John Hardin:
>> > >   On Sat, 14 May 2016, Reindl Harald wrote:
>> > > >   Am 14.05.2016 um 04:04 schrieb John Hardin:
>> > > > >    How would a webservice be better? That would still be 
>> > > > >   sending customer emails to a third party for processing. uhm 
>> > > > >   you missed "and only give the rules which hitted and
>> > > > >  spam/ham flag out"
>> > > 
>> > >   Ah, OK, I misunderstood what you were suggesting.
>> > > 
>> > >   That wouldn't work. That tells you the rules they hit at the time 
>> > >   they were scanned, not which rules they would hit from the 
>> > >   current testing rules.
>> > 
>> >  on the other hand it would reflect the complete mail-flow and not just
>> >  hand-crafted samples
>>
>>  It's not hand *crafted* samples, it's hand *classified* samples. The
>>  message needs to be classified by a reliable human as ham or spam for
>>  the analysis of the rules that it hits to have any use, or even be
>>  possible.
>
> that's just nitpicking - i can correct you the same way in german for most of 
> you would try to express :-)

Yes, probably.

>>  That's why doing something like having an SA install that's based on the
>>  current SVN sandbox rules, and that gets a forked copy of your mail
>>  stream, and that captures the hits, is still not useful for anything
>>  other than gross "this rule didn't hit anything" analysis - you don't
>>  know what a given message *should* have been, so you can't say anything
>>  about the rules that hit it - whether they aid that result, or hider it.
>
> how do you imagine such a setup *in practice*

Somewhat stream-of-consciousness:

In addition to your normal deliver-to-the-user MTA, have another MTA that 
is running against an SA that is configured from SVN. Note that this 
wouldn't be a backup MTA, it would have to get a copy of your inbound mail 
stream. Not sure how you'd fork the mail delivery process, that's probably 
MTA-dependent.

The masscheck MTA would deliver to SA, record the rule hits and 
classification in the masscheck upload format, and discard the message.

Normal delivery would usually be suspended so that messages queue.

When the masscheck start time is reached, update from SVN, recompile the 
rules, clear the log and enable MTA delivery. The queued messages would be 
scanned and recorded until the upload time is reached, at which time 
delivery is suspended again. This may or may not be long enough to clear 
the queue.

The results would then be uploaded.

As you noted, there would have to be some minimum score for recording the 
message as spam, and some maximum score for recording it as ham. Anything 
in between would have to be discarded as ambiguous. There might also need 
to be some kind of weighting on the results when they are incorporated 
into masscheck to reflect that they are not hand-classified and thus their 
reliability isn't as good as we'd like, however there have been 
misclassifications in hand-classified corpora before so if the thresholds 
are well-chosen that may not be an issue.

But note, this would probably not help offset a high-scoring FP rule as 
the message would be auto-classified as spam or, at best, ambiguous - it 
might actually be self-reinforcing and make the situation worse, rather 
than help it be self-correcting as hand-classified corpora would. It also 
won't probably help much with new rules.

I don't really think there's any way around having hand-classified clean 
and complete corpora for running masschecks.

>>  Unless your mail stream prior to SA is *guaranteed* 100% ham (which is
>>  hugely unlikely or why would you be running SA at all?) or 100% spam
>>  (which might be the case for a clean honeypot), you need to review and
>>  classify the messages manually before performing the scan and reporting
>>  the rule hits, and that means keeping copies of the pristine messages,
>>  at least for a while.
>>
>>  I don't know whether statutory requirement make this impossible for you
>>  even if you did obtain consent from some of your clients to use their
>>  mail stream in that manner.
>
> i don't have access to the whole mailflow to classify it nor is there a 
> technical way to mirror it on a different setup

OK

> nor would SA or even smtpd ever see 95% of junk because content filters 
> are the last ressort by definition

It's not too difficult for masscheck to get spam, as there are honeypots 
feeding masscheck. It's harder to get ham, especially non-English ham, so 
contributing to masscheck from a 99% clean feed is still helpful.

>> >  should be chained in a minimum negative score to count as ham and a
>> >  minimum positive to count as spam - configureable because it depends
>> >  on the local environment and adjustments which scores are clear
>> >  classifications, 7.0 would here not be 100% spam, 12.0 would be as
>> >  example
>>
>>  That's probably still not reliable enough for use in masscheck. Ham is a
>>  bit more important; what would you recommend as a lower limit for
>>  considering a message as ham? How many actual hams would meet that
>>  requiement? It might be a lot of work for little final benefit. What
>>  percentage actual FNs would you see with that setting? Those would
>>  damage the masscheck analysis.
>
> i would agree if we could call the current masscheck results reliable
>
>> >  it would at least help in the current situation and with a rule like
>> >  FSL_HELO_HOME when it hits only clear ham and has a high spam-score
>> >  and when it only needs to be enabled, collects the information through
>> >  scanning and submit the results once per day a lot of people running
>> >  milter like setups with reject and no access to rejected mails could
>> >  help to improve to auto-QA without collecting whole mails
>>
>>  Potentially. You'd have to be willing to set up a parallel mail
>>  processing stream using the current SVN sandbox rules as I described
>>  above. Performing analysis on the released rules provides no benefit to
>>  masscheck
>
> why would it provide no benefit when one part of the "sa-update" which 
> currently don't get any updates most of the time is to re-score badly socred 
> rules - that's really not only about sandbox rules

Because the rules in question may have changed since the last update was 
released. The analysis needs to be of the current state of the rules in 
SVN - take a snapshot, masscheck it and generate scores, and those rules 
and their scores are released as an update if the corpora are large enough 
for the results to be considered reliable. (Note that "reliability" is 
based on the *size* of the corpora. We sadly don't have any way to judge 
it based on broadness of content.)

-- 
  John Hardin KA7OHZ                    http://www.impsec.org/~jhardin/
  jhardin@impsec.org    FALaholic #11174     pgpk -a jhardin@impsec.org
  key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
-----------------------------------------------------------------------
   ...much of our country's counterterrorism security spending is not
   designed to protect us from the terrorists, but instead to protect
   our public officials from criticism when another attack occurs.
                                                     -- Bruce Schneier
-----------------------------------------------------------------------
  144 days since the first successful real return to launch site (SpaceX)

Re: FSL_HELO_HOME: deep headers again

Posted by Reindl Harald <h....@thelounge.net>.


Am 14.05.2016 um 19:10 schrieb John Hardin:
> On Sat, 14 May 2016, Reindl Harald wrote:
>
>> Am 14.05.2016 um 04:50 schrieb John Hardin:
>>>  On Sat, 14 May 2016, Reindl Harald wrote:
>>> >  Am 14.05.2016 um 04:04 schrieb John Hardin:
>>> > >   How would a webservice be better? That would still be sending >
>>> >   customer
>>> > >   emails to a third party for processing.
>>> > >  uhm you missed "and only give the rules which hitted and
>>> spam/ham flag
>>> >  out"
>>>
>>>  Ah, OK, I misunderstood what you were suggesting.
>>>
>>>  That wouldn't work. That tells you the rules they hit at the time they
>>>  were scanned, not which rules they would hit from the current testing
>>>  rules.
>>
>> on the other hand it would reflect the complete mail-flow and not just
>> hand-crafted samples
>
> It's not hand *crafted* samples, it's hand *classified* samples. The
> message needs to be classified by a reliable human as ham or spam for
> the analysis of the rules that it hits to have any use, or even be
> possible.

that's just nitpicking - i can correct you the same way in german for 
most of you would try to express :-)

> That's why doing something like having an SA install that's based on the
> current SVN sandbox rules, and that gets a forked copy of your mail
> stream, and that captures the hits, is still not useful for anything
> other than gross "this rule didn't hit anything" analysis - you don't
> know what a given message *should* have been, so you can't say anything
> about the rules that hit it - whether they aid that result, or hider it.

how do you imagine such a setup *in practice*

> Unless your mail stream prior to SA is *guaranteed* 100% ham (which is
> hugely unlikely or why would you be running SA at all?) or 100% spam
> (which might be the case for a clean honeypot), you need to review and
> classify the messages manually before performing the scan and reporting
> the rule hits, and that means keeping copies of the pristine messages,
> at least for a while.
>
> I don't know whether statutory requirement make this impossible for you
> even if you did obtain consent from some of your clients to use their
> mail stream in that manner.

i don't have access to the whole mailflow to classify it nor is there a 
technical way to mirror it on a different setup nor would SA or even 
smtpd ever see 95% of junk because content filters are the last ressort 
by definition

>> should be chained in a minimum negative score to count as ham and a
>> minimum positive to count as spam - configureable because it depends
>> on the local environment and adjustments which scores are clear
>> classifications, 7.0 would here not be 100% spam, 12.0 would be as
>> example
>
> That's probably still not reliable enough for use in masscheck. Ham is a
> bit more important; what would you recommend as a lower limit for
> considering a message as ham? How many actual hams would meet that
> requiement? It might be a lot of work for little final benefit. What
> percentage actual FNs would you see with that setting? Those would
> damage the masscheck analysis.

i would agree if we could call the current masscheck results reliable

>> it would at least help in the current situation and with a rule like
>> FSL_HELO_HOME when it hits only clear ham and has a high spam-score
>> and when it only needs to be enabled, collects the information through
>> scanning and submit the results once per day a lot of people running
>> milter like setups with reject and no access to rejected mails could
>> help to improve to auto-QA without collecting whole mails
>
> Potentially. You'd have to be willing to set up a parallel mail
> processing stream using the current SVN sandbox rules as I described
> above. Performing analysis on the released rules provides no benefit to
> masscheck

why would it provide no benefit when one part of the "sa-update" which 
currently don't get any updates most of the time is to re-score badly 
socred rules - that's really not only about sandbox rules

Re: FSL_HELO_HOME: deep headers again

Posted by John Hardin <jh...@impsec.org>.

On Sat, 14 May 2016, Reindl Harald wrote:

> Am 14.05.2016 um 04:50 schrieb John Hardin:
>>  On Sat, 14 May 2016, Reindl Harald wrote:
>> >  Am 14.05.2016 um 04:04 schrieb John Hardin:
>> > >   How would a webservice be better? That would still be sending 
>> > >   customer
>> > >   emails to a third party for processing.
>> > 
>> >  uhm you missed "and only give the rules which hitted and spam/ham flag
>> >  out"
>>
>>  Ah, OK, I misunderstood what you were suggesting.
>>
>>  That wouldn't work. That tells you the rules they hit at the time they
>>  were scanned, not which rules they would hit from the current testing
>>  rules.
>
> on the other hand it would reflect the complete mail-flow and not just 
> hand-crafted samples

It's not hand *crafted* samples, it's hand *classified* samples. The 
message needs to be classified by a reliable human as ham or spam for the 
analysis of the rules that it hits to have any use, or even be possible.

That's why doing something like having an SA install that's based on the 
current SVN sandbox rules, and that gets a forked copy of your mail 
stream, and that captures the hits, is still not useful for anything other 
than gross "this rule didn't hit anything" analysis - you don't know what 
a given message *should* have been, so you can't say anything about the 
rules that hit it - whether they aid that result, or hider it.

Unless your mail stream prior to SA is *guaranteed* 100% ham (which is 
hugely unlikely or why would you be running SA at all?) or 100% spam 
(which might be the case for a clean honeypot), you need to review and 
classify the messages manually before performing the scan and reporting 
the rule hits, and that means keeping copies of the pristine messages, at 
least for a while.

I don't know whether statutory requirement make this impossible for you 
even if you did obtain consent from some of your clients to use their mail 
stream in that manner.

> should be chained in a minimum negative score to count as ham and a minimum 
> positive to count as spam - configureable because it depends on the local 
> environment and adjustments which scores are clear classifications, 7.0 would 
> here not be 100% spam, 12.0 would be as example

That's probably still not reliable enough for use in masscheck. Ham is a 
bit more important; what would you recommend as a lower limit for 
considering a message as ham? How many actual hams would meet that 
requiement? It might be a lot of work for little final benefit. What 
percentage actual FNs would you see with that setting? Those would damage 
the masscheck analysis.

> it would at least help in the current situation and with a rule like 
> FSL_HELO_HOME when it hits only clear ham and has a high spam-score and when 
> it only needs to be enabled, collects the information through scanning and 
> submit the results once per day a lot of people running milter like setups 
> with reject and no access to rejected mails could help to improve to auto-QA 
> without collecting whole mails

Potentially. You'd have to be willing to set up a parallel mail processing 
stream using the current SVN sandbox rules as I described above. 
Performing analysis on the released rules provides no benefit to 
masscheck.

>> > >   Corpora with headers stripped does present a problem. The masscheck
>> > >   corpora should be complete as received
>> > 
>> >  and that is not possible - samples are stripped and anonymized

-- 
  John Hardin KA7OHZ                    http://www.impsec.org/~jhardin/
  jhardin@impsec.org    FALaholic #11174     pgpk -a jhardin@impsec.org
  key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
-----------------------------------------------------------------------
   Maxim IX: Never turn your back on an enemy.
-----------------------------------------------------------------------
  144 days since the first successful real return to launch site (SpaceX)

Re: FSL_HELO_HOME: deep headers again

Posted by Reindl Harald <h....@thelounge.net>.


Am 14.05.2016 um 04:50 schrieb John Hardin:
> On Sat, 14 May 2016, Reindl Harald wrote:
>> Am 14.05.2016 um 04:04 schrieb John Hardin:
>>>  How would a webservice be better? That would still be sending customer
>>>  emails to a third party for processing.
>>
>> uhm you missed "and only give the rules which hitted and spam/ham flag
>> out"
>
> Ah, OK, I misunderstood what you were suggesting.
>
> That wouldn't work. That tells you the rules they hit at the time they
> were scanned, not which rules they would hit from the current testing
> rules.

on the other hand it would reflect the complete mail-flow and not just 
hand-crafted samples

should be chained in a minimum negative score to count as ham and a 
minimum positive to count as spam - configureable because it depends on 
the local environment and adjustments which scores are clear 
classifications, 7.0 would here not be 100% spam, 12.0 would be as example

it would at least help in the current situation and with a rule like 
FSL_HELO_HOME when it hits only clear ham and has a high spam-score and 
when it only needs to be enabled, collects the information through 
scanning and submit the results once per day a lot of people running 
milter like setups with reject and no access to rejected mails could 
help to improve to auto-QA without collecting whole mails

>>>  Corpora with headers stripped does present a problem. The masscheck
>>>  corpora should be complete as received
>>
>> and that is not possible - samples are stripped and anonymized

Re: FSL_HELO_HOME: deep headers again

Posted by John Hardin <jh...@impsec.org>.

On Sat, 14 May 2016, Reindl Harald wrote:

>
>
> Am 14.05.2016 um 04:04 schrieb John Hardin:
>>  On Fri, 13 May 2016, Reindl Harald wrote:
>> >  i can't rsync customer mails to a 3rd party
>>
>>  You don't have to. You run the masscheck locally and only upload the
>>  rule hit results. I upload my corpora because they are just my email and
>>  are thus tiny.
>>
>>  If you select your corpora filenames properly, no information should leak.
>
> OK
>
>> >  if that would be based on some webervice where you just feed local
>> >  samples and only give the rules which hitted and spam/ham flag out it
>> >  would be somehow possible
>>
>>  How would a webservice be better? That would still be sending customer
>>  emails to a third party for processing.
>
> uhm you missed "and only give the rules which hitted and spam/ham flag out"

Ah, OK, I misunderstood what you were suggesting.

That wouldn't work. That tells you the rules they hit at the time they 
were scanned, not which rules they would hit from the current testing 
rules.

>> >  especially you would not have much from the bayes-samples because they
>> >  would trigger all sort of wrong rules after strip most headers and and
>> >  a generic received header (which seems to be needed by the
>> >  bayes-engine for whatever reason since it otherwise scores samples
>> >  completly different)
>>
>>  Corpora with headers stripped does present a problem. The masscheck
>>  corpora should be complete as received
>
> and that is not possible - samples are stripped and anonymized

-- 
  John Hardin KA7OHZ                    http://www.impsec.org/~jhardin/
  jhardin@impsec.org    FALaholic #11174     pgpk -a jhardin@impsec.org
  key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
-----------------------------------------------------------------------
   Justice is justice, whereas "social justice" is code for one set
   of rules for the rich, another for the poor; one set for whites,
   another set for minorities; one set for straight men, another for
   women and gays. In short, it's the opposite of actual justice.
                                                     -- Burt Prelutsky
-----------------------------------------------------------------------
  143 days since the first successful real return to launch site (SpaceX)

Re: FSL_HELO_HOME: deep headers again

Posted by Reindl Harald <h....@thelounge.net>.


Am 14.05.2016 um 04:04 schrieb John Hardin:
> On Fri, 13 May 2016, Reindl Harald wrote:
>> i can't rsync customer mails to a 3rd party
>
> You don't have to. You run the masscheck locally and only upload the
> rule hit results. I upload my corpora because they are just my email and
> are thus tiny.
>
> If you select your corpora filenames properly, no information should leak.

OK

>> if that would be based on some webervice where you just feed local
>> samples and only give the rules which hitted and spam/ham flag out it
>> would be somehow possible
>
> How would a webservice be better? That would still be sending customer
> emails to a third party for processing.

uhm you missed "and only give the rules which hitted and spam/ham flag out"

>> especially you would not have much from the bayes-samples because they
>> would trigger all sort of wrong rules after strip most headers and and
>> a generic received header (which seems to be needed by the
>> bayes-engine for whatever reason since it otherwise scores samples
>> completly different)
>
> Corpora with headers stripped does present a problem. The masscheck
> corpora should be complete as received

and that is not possible - samples are stripped and anonymized

Re: FSL_HELO_HOME: deep headers again

Posted by John Hardin <jh...@impsec.org>.

On Fri, 13 May 2016, Reindl Harald wrote:

>
> Am 13.05.2016 um 18:11 schrieb John Hardin:
>>  On Fri, 13 May 2016, Reindl Harald wrote:
>> 
>> >  the problem is blowing out such rules with such scores at all with a
>> >  non working auto-QA (non-working in: no correction for days as well as
>> >  dangerous scoring of new rules from the start)
>> > 
>> >  02-Mai-2016 00:12:34: SpamAssassin: No update available
>> >  03-Mai-2016 01:55:05: SpamAssassin: No update available
>> >  04-Mai-2016 00:43:33: SpamAssassin: No update available
>> >  05-Mai-2016 01:48:15: SpamAssassin: Update processed successfully
>> >  06-Mai-2016 00:53:17: SpamAssassin: No update available
>> >  07-Mai-2016 01:21:23: SpamAssassin: No update available
>> >  08-Mai-2016 01:38:23: SpamAssassin: No update available
>> >  09-Mai-2016 00:02:56: SpamAssassin: No update available
>> >  10-Mai-2016 01:10:29: SpamAssassin: No update available
>> >  11-Mai-2016 00:55:46: SpamAssassin: No update available
>> >  12-Mai-2016 00:21:17: SpamAssassin: Update processed successfully
>> >  13-Mai-2016 00:33:31: SpamAssassin: No update available
>>
>>  Perhaps you could help with that by participating in masscheck. You seem
>>  to get a lot of FPs on base rules; contributing masscheck results on
>>  your ham would reduce those
>
> i can't rsync customer mails to a 3rd party

You don't have to. You run the masscheck locally and only upload the rule 
hit results. I upload my corpora because they are just my email and are 
thus tiny.

If you select your corpora filenames properly, no information should leak.

> if that would be based on some webervice where you just feed local samples 
> and only give the rules which hitted and spam/ham flag out it would be 
> somehow possible

How would a webservice be better? That would still be sending customer 
emails to a third party for processing.

> especially you would not have much from the bayes-samples because they 
> would trigger all sort of wrong rules after strip most headers and and a 
> generic received header (which seems to be needed by the bayes-engine 
> for whatever reason since it otherwise scores samples completly 
> different)

Corpora with headers stripped does present a problem. The masscheck 
corpora should be complete as received.

-- 
  John Hardin KA7OHZ                    http://www.impsec.org/~jhardin/
  jhardin@impsec.org    FALaholic #11174     pgpk -a jhardin@impsec.org
  key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
-----------------------------------------------------------------------
   People think they're trading chaos for order [by ceding more and
   more power to the Government], but they're just trading normal
   human evil for the really dangerous organized kind of evil, the
   kind that simply does not give a shit. Only bureaucrats can give
   you true evil.                                     -- Larry Correia
-----------------------------------------------------------------------
  143 days since the first successful real return to launch site (SpaceX)

Re: FSL_HELO_HOME: deep headers again

Posted by Reindl Harald <h....@thelounge.net>.

Am 13.05.2016 um 18:11 schrieb John Hardin:
> On Fri, 13 May 2016, Reindl Harald wrote:
>
>> the problem is blowing out such rules with such scores at all with a
>> non working auto-QA (non-working in: no correction for days as well as
>> dangerous scoring of new rules from the start)
>>
>> 02-Mai-2016 00:12:34: SpamAssassin: No update available
>> 03-Mai-2016 01:55:05: SpamAssassin: No update available
>> 04-Mai-2016 00:43:33: SpamAssassin: No update available
>> 05-Mai-2016 01:48:15: SpamAssassin: Update processed successfully
>> 06-Mai-2016 00:53:17: SpamAssassin: No update available
>> 07-Mai-2016 01:21:23: SpamAssassin: No update available
>> 08-Mai-2016 01:38:23: SpamAssassin: No update available
>> 09-Mai-2016 00:02:56: SpamAssassin: No update available
>> 10-Mai-2016 01:10:29: SpamAssassin: No update available
>> 11-Mai-2016 00:55:46: SpamAssassin: No update available
>> 12-Mai-2016 00:21:17: SpamAssassin: Update processed successfully
>> 13-Mai-2016 00:33:31: SpamAssassin: No update available
>
> Perhaps you could help with that by participating in masscheck. You seem
> to get a lot of FPs on base rules; contributing masscheck results on
> your ham would reduce those

i can't rsync customer mails to a 3rd party

if that would be based on some webervice where you just feed local 
samples and only give the rules which hitted and spam/ham flag out it 
would be somehow possible

especially you would not have much from the bayes-samples because they 
would trigger all sort of wrong rules after strip most headers and and a 
generic received header (which seems to be needed by the bayes-engine 
for whatever reason since it otherwise scores samples completly different)

in any case: such a rule with 3.7 must not happen at all, even if it has 
no such bad impact - 3.7 is very high and only deserved when you are 
certain that a mail is spam which is *not* backed by a single header, 
deep inspection or not

Re: FSL_HELO_HOME: deep headers again

Posted by John Hardin <jh...@impsec.org>.

On Fri, 13 May 2016, Reindl Harald wrote:

> the problem is blowing out such rules with such scores at all with a non 
> working auto-QA (non-working in: no correction for days as well as dangerous 
> scoring of new rules from the start)
>
> 02-Mai-2016 00:12:34: SpamAssassin: No update available
> 03-Mai-2016 01:55:05: SpamAssassin: No update available
> 04-Mai-2016 00:43:33: SpamAssassin: No update available
> 05-Mai-2016 01:48:15: SpamAssassin: Update processed successfully
> 06-Mai-2016 00:53:17: SpamAssassin: No update available
> 07-Mai-2016 01:21:23: SpamAssassin: No update available
> 08-Mai-2016 01:38:23: SpamAssassin: No update available
> 09-Mai-2016 00:02:56: SpamAssassin: No update available
> 10-Mai-2016 01:10:29: SpamAssassin: No update available
> 11-Mai-2016 00:55:46: SpamAssassin: No update available
> 12-Mai-2016 00:21:17: SpamAssassin: Update processed successfully
> 13-Mai-2016 00:33:31: SpamAssassin: No update available

Perhaps you could help with that by participating in masscheck. You seem 
to get a lot of FPs on base rules; contributing masscheck results on your 
ham would reduce those.

-- 
  John Hardin KA7OHZ                    http://www.impsec.org/~jhardin/
  jhardin@impsec.org    FALaholic #11174     pgpk -a jhardin@impsec.org
  key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
-----------------------------------------------------------------------
   ...much of our country's counterterrorism security spending is not
   designed to protect us from the terrorists, but instead to protect
   our public officials from criticism when another attack occurs.
                                                     -- Bruce Schneier
-----------------------------------------------------------------------
  143 days since the first successful real return to launch site (SpaceX)

Re: FSL_HELO_HOME: deep headers again

Posted by Reindl Harald <h....@thelounge.net>.


Am 13.05.2016 um 16:25 schrieb RW:
> On Fri, 13 May 2016 15:42:07 +0200
> Reindl Harald wrote:
>
>> WTF - Received: from daves-air.home ([1.125.7.92]) is another time a
>> DEEP HEADER Inspection -
>
> This looks like a simple mistake rather than a deliberate attempt at a
> deep check. You should file a bug report

the problem is blowing out such rules with such scores at all with a non 
working auto-QA (non-working in: no correction for days as well as 
dangerous scoring of new rules from the start)

02-Mai-2016 00:12:34: SpamAssassin: No update available
03-Mai-2016 01:55:05: SpamAssassin: No update available
04-Mai-2016 00:43:33: SpamAssassin: No update available
05-Mai-2016 01:48:15: SpamAssassin: Update processed successfully
06-Mai-2016 00:53:17: SpamAssassin: No update available
07-Mai-2016 01:21:23: SpamAssassin: No update available
08-Mai-2016 01:38:23: SpamAssassin: No update available
09-Mai-2016 00:02:56: SpamAssassin: No update available
10-Mai-2016 01:10:29: SpamAssassin: No update available
11-Mai-2016 00:55:46: SpamAssassin: No update available
12-Mai-2016 00:21:17: SpamAssassin: Update processed successfully
13-Mai-2016 00:33:31: SpamAssassin: No update available

Re: FSL_HELO_HOME: deep headers again

Posted by RW <rw...@googlemail.com>.

On Fri, 13 May 2016 15:42:07 +0200
Reindl Harald wrote:

> WTF - Received: from daves-air.home ([1.125.7.92]) is another time a 
> DEEP HEADER Inspection -

This looks like a simple mistake rather than a deliberate attempt at a
deep check. You should file a bug report.