You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@spamassassin.apache.org by "Kevin A. McGrail" <km...@apache.org> on 2019/05/31 00:35:20 UTC

__E_LIKE_LETTER & __LOWER_E filling subtests debug

I was curious if anyone noticed the debug output for subtests has gotten
insane:

May 30 20:29:35.492 [8403] dbg: check:
subtests=__AM_DYING,__ANY_TEXT_ATTACH,__ANY_TEXT_ATTACH_DOC,__BODY_TEXT_LINE,__BODY_TEXT_LINE,__BODY_TEXT_LINE,__CT,__CTYPE_HAS_BOUNDARY,__CTYPE_MULTIPART_ANY,__CTYPE_MULTIPART_MIXED,__DKIMWL_FREEMAIL,__DKIMWL_WL_MED,__DKIM_DEPENDABLE,__DKIM_EXISTS,__DOS_BODY_WED,__DOS_HAS_ANY_URI,__DOS_HAS_LIST_ID,__DOS_HAS_LIST_UNSUB,__DOS_HAS_MAILING_LIST,__DOS_LINK,__DOS_RCVD_MON,__DOS_REF_2_WK_DAYS,__DOS_RELAYED_EXT,__DOS_SINGLE_EXT_RELAY,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__FB_NUM_PERCNT,__FB_TOUR,__FILL_THIS_FORM_PARTIAL_RAW,__FRAUD_DBI,__FRAUD_PTS,__FRAUD_PTX,__FSL_HAS_LIST_UNSUB,__FSL_RELAY_GOOGLE,__HAS_ANY_EMAIL,__HAS_ANY_URI,__HAS_DATE,__HAS_DKIM_SIGHD,__HAS_FROM,__HAS_HREF,__HAS_IMG_SRC,__HAS_LIST_ID,__HAS_MESSAGE_ID,__HAS_MSGID,__HAS_RCVD,__HAS_SENDER,__HAS_SUBJECT,__HAS_TO,__HAS_URI,__HAS_X_BEEN_THERE,__HAVE_BOUNCE_RELAYS,__HIGHBITS,__HUSH_HUSH,__JMQ_PICKUP4,__KAM_ADVERTISE1,__KAM_ANDROGEL3,__KAM_AUTO2,__KAM_AUTO3,__KAM_BACK3,__KAM_BLOOD3,__KAM_BLOOD6,__KAM_CARD5,__KAM_CEP6,__KAM_COUNT_URIS,__KAM_COUNT_URIS,__KAM_COUNT_URIS,__KAM_COUNT_URIS,__KAM_COUNT_URIS,__KAM_COUNT_URIS,__KAM_COUNT_URIS,__KAM_COUNT_URIS,__KAM_COUNT_URIS,__KAM_COUNT_URIS,__KAM_COUNT_URIS,__KAM_COUNT_URIS,__KAM_COUNT_URIS,__KAM_COUNT_URIS,__KAM_COUNT_URIS,__KAM_COUNT_URIS,__KAM_CRIM4,__KAM_DROPBOX2,__KAM_FAKEDELIVER12,__KAM_FAKEDELIVER4,__KAM_FAKEDELIVER6,__KAM_FAKEDELIVER8,__KAM_GENERICHEALTH3,__KAM_GIFT2,__KAM_GOOGLE2_2,__KAM_HARP3,__KAM_HAS_10_URIS,__KAM_HAS_15_URIS,__KAM_HAS_1_URIS,__KAM_HAS_2_URIS,__KAM_HAS_3_URIS,__KAM_HAS_4_URIS,__KAM_HAS_5_URIS,__KAM_HUGEIMGSRC,__KAM_INSURE4,__KAM_JESUS1,__KAM_JURY3,__KAM_LIST4,__KAM_LOTTO3,__KAM_MAILSPLOIT2,__KAM_MARIJUANA3,__KAM_MED2,__KAM_MULTIPLE_FROM,__KAM_NIGERIAN2,__KAM_NIGERIAN2_3,__KAM_NIGERIAN2_7,__KAM_NIGERIAN3,__KAM_NIGERIAN4,__KAM_NUMSUBJECT,__KAM_OBF2,__KAM_OPRAH2,__KAM_OZ3,__KAM_PATRIOT2,__KAM_PATRIOT3,__KAM_PAYPAL3B,__KAM_POLITICS2,__KAM_PROPHET1,__KAM_PROPHET2,__KAM_PROPHET4,__KAM_PROPHET5,__KAM_REFI2,__KAM_REHAB3,__KAM_RELIGION3,__KAM_RPTR_PASSED,__KAM_SEO7,__KAM_SHARKPROD,__KAM_SUBJECTYEAR,__KAM_TAX4,__KAM_TIME4,__KAM_TRUMPCOIN1,__KAM_TVDOCTOR3,__KAM_UNIV1B,__KAM_UPS2,__KAM_VOICEMAIL1,__KAM_WU1,__KAM_ZWNJ2,__KAM_ZWNJ2,__KAM_ZWNJ2,__KAM_ZWNJ2,__KAM_ZWNJ2,__KAM_ZWNJ2,__KAM_ZWNJ2,__KAM_ZWNJ2,__KAM_ZWNJ2,__KAM_ZWNJ2,__KAM_ZWNJ2,__KAM_ZWNJ2,__KAM_ZWNJ2,__KAM_ZWNJ2,__KAM_ZWNJ2,__KAM_ZWNJ2,__LAST_EXTERNAL_RELAY_NO_AUTH,__LAST_UNTRUSTED_RELAY_NO_AUTH,__LOCAL_PP_NONPPURL,__LONGLINE,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__L_BODY_8BITS,__MIME_HTML,__MIME_VERSION,__MISSING_REF,__MISSING_REPLY,__ML1,__ML2,__ML3,__ML4,__MOZILLA_MSGID,__MSGID_GUID,__MSGID_OK_HEX,__MSGID_OK_HOST,__MSOE_MID_WRONG_CASE,__NONEMPTY_BODY,__NOT_A_PERSON,__NOT_SPOOFED,__NUMBERS_IN_SUBJ,__RB_GT_200,__RCD_RDNS_MAIL,__RCD_RDNS_MAIL_MESSY,__RCVD_IN_DNSWL,__RCVD_IN_HOSTKARMA,__RCVD_IN_MSPIKE_L,__RCVD_IN_SENDERSCORE_90_100,__RCVD_IN_SORBS,__RESIGNER2,__SANE_MSGID,__SENDER_BOT,__SPOOFED_URL,__SUBJ_NOT_SHORT,__SUBSCRIPTION_INFO,__TOCC_EXISTS,__TO_EQ_FROM_USR_1,__TO_EQ_FROM_USR_NN_1,__TVD_MIME_ATT_TP,__UNSUB_EMAIL,__UNSUB_LINK,__URI_GOOGLE_PROXY,__URI_MAILTO,__URI_MAILTO,__VIA_ML,__VIA_RESIGNER,__freemail_safe


72_active.cf:    body            __LOWER_E       /e/
72_active.cf:    tflags          __LOWER_E       multiple maxhits=230

72_active.cf:    body            __E_LIKE_LETTER /<lcase_e>/
72_active.cf:    tflags          __E_LIKE_LETTER multiple maxhits=320

Assuming those maxhits are correct, maybe we need something in the debug
output that says __E_LIKE_LETTER (number of hits if more than 1).

Regards,

KAM

-- 
Kevin A. McGrail
Member, Apache Software Foundation
Chair Emeritus Apache SpamAssassin Project
https://www.linkedin.com/in/kmcgrail - 703.798.0171



Re: __E_LIKE_LETTER & __LOWER_E filling subtests debug

Posted by John Hardin <jh...@impsec.org>.
On Fri, 31 May 2019, Bill Cole wrote:

> On 31 May 2019, at 7:46, Kevin A. McGrail wrote:
>
>> Well there might be 6 rules like this but testing emails for lower case e
>> hits the maxhits on a lot of emails.
>
> If anyone has insight into how I might measure a character occurrence ratio 
> in messages less noisily, I'm eager to be enlightened.

"less noisily" would probably require a plugin.

-- 
  John Hardin KA7OHZ                    http://www.impsec.org/~jhardin/
  jhardin@impsec.org    FALaholic #11174     pgpk -a jhardin@impsec.org
  key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
-----------------------------------------------------------------------
   The Tea Party wants to remove the Crony from Crony Capitalism.
   OWS wants to remove Capitalism from Crony Capitalism.
                                                     -- Astaghfirullah
-----------------------------------------------------------------------
  6 days until the 75th anniversary of D-Day

Re: __E_LIKE_LETTER & __LOWER_E filling subtests debug

Posted by Bill Cole <sa...@billmail.scconsult.com>.
On 31 May 2019, at 7:46, Kevin A. McGrail wrote:

> Well there might be 6 rules like this but testing emails for lower 
> case e
> hits the maxhits on a lot of emails.

If anyone has insight into how I might measure a character occurrence 
ratio in messages less noisily, I'm eager to be enlightened.

-- 
Bill Cole
bill@scconsult.com or billcole@apache.org
(AKA @grumpybozo and many *@billmail.scconsult.com addresses)
Not Currently Available For Hire

Re: __E_LIKE_LETTER & __LOWER_E filling subtests debug

Posted by "Kevin A. McGrail" <km...@apache.org>.
Well there might be 6 rules like this but testing emails for lower case e
hits the maxhits on a lot of emails.  It has made debugging rules onerous
because subtexts debug gets loooong.

I will look at some code to collapse it and come back to the list.

On Fri, May 31, 2019, 06:50 Karsten Bräckelmann <gu...@rudersport.de>
wrote:

> On Thu, 2019-05-30 at 20:35 -0400, Kevin A. McGrail wrote:
> > I was curious if anyone noticed the debug output for subtests has
> > gotten
> > insane:
> >
> > May 30 20:29:35.492 [8403] dbg: check:
> > subtests=__AM_DYING,__ANY_TEXT_ATTACH,__ANY_TEXT_ATTACH_DOC,__BODY_TE
> > XT_LINE,__BODY_TEXT_LINE,__BODY_TEXT_LINE,__CT,__CTYPE_HAS_BOUNDARY,_
> > _CTYPE_MULTIPART_ANY,__CTYPE_MULTIPART_MIXED,__DKIMWL_FREEMAIL,__DKIM
> > WL_WL_MED,__DKIM_DEPENDABLE,__DKIM_EXISTS,__DOS_BODY_WED,__DOS_HAS_AN
> > Y_URI,__DOS_HAS_LIST_ID,__DOS_HAS_LIST_UNSUB,__DOS_HAS_MAILING_LIST,_
> > _DOS_LINK,__DOS_RCVD_MON,__DOS_REF_2_WK_DAYS,__DOS_RELAYED_EXT,__DOS_
> > SINGLE_EXT_RELAY,__E_LIKE_LETTER,__E_LIKE_LETTER, [...]
>
> Shortened that output of the _SUBTESTS_ Template Tag. Suffice to say it
> showed both rules __LOWER_E and __E_LIKE_LETTER their respective
> maxhits times.
>
>
> > 72_active.cf:    body            __LOWER_E       /e/
> > 72_active.cf:    tflags          __LOWER_E       multiple maxhits=230
> >
> > 72_active.cf:    body            __E_LIKE_LETTER /<lcase_e>/
> > 72_active.cf:    tflags          __E_LIKE_LETTER multiple maxhits=320
> >
> > Assuming those maxhits are correct,
>
> There are a total of 6 rules with maxhits >= 100 (or > 26 in case you
> prefer), all sub-rules. Also, all living in Bill's sandbox...
>
>
> > maybe we need something in the debug
> > output that says __E_LIKE_LETTER (number of hits if more than 1).
>
> With maxhits of 20+ I'd be in favor of such an abbreviation.
>
> Applied in any case of more than a single hit, I don't mean to collapse
> in cases exceeding a threshold. I just mean it's not worth bothering if
> there are only instances of about 10 at most...
>
> Do note though that it is the same with regular (non-sub) rules: With
> tflags multiple they are listed up to maxhits times in both header
> style _TESTS_ and _REPORT_ as well as body style _SUMMARY_ Template
> Tags.
>
>
> Reminds me of the multiple and maxhits long pre-dating Rules Emporium
> backhair and friends rule-sets with a similarly ridiculous tests hit
> report pattern due to basically implementing these counting features in
> pure meta...
>
>
> --
> char *t="\10pse\0r\0dtu\0.@ghno
> \x4e\xc8\x79\xf4\xab\x51\x8a\x10\xf4\xf4\xc4";
> main(){ char h,m=h=*t++,*x=t+2*h,c,i,l=*x,s=0; for (i=0;i<l;i++){ i%8?
> c<<=1:
> (c=*++x); c&128 && (s+=h); if (!(h>>=1)||!t[s+h]){ putchar(t[s]);h=m;s=0;
> }}}
>
>

Re: __E_LIKE_LETTER & __LOWER_E filling subtests debug

Posted by Karsten Bräckelmann <gu...@rudersport.de>.
On Thu, 2019-05-30 at 20:35 -0400, Kevin A. McGrail wrote:
> I was curious if anyone noticed the debug output for subtests has
> gotten
> insane:
> 
> May 30 20:29:35.492 [8403] dbg: check:
> subtests=__AM_DYING,__ANY_TEXT_ATTACH,__ANY_TEXT_ATTACH_DOC,__BODY_TE
> XT_LINE,__BODY_TEXT_LINE,__BODY_TEXT_LINE,__CT,__CTYPE_HAS_BOUNDARY,_
> _CTYPE_MULTIPART_ANY,__CTYPE_MULTIPART_MIXED,__DKIMWL_FREEMAIL,__DKIM
> WL_WL_MED,__DKIM_DEPENDABLE,__DKIM_EXISTS,__DOS_BODY_WED,__DOS_HAS_AN
> Y_URI,__DOS_HAS_LIST_ID,__DOS_HAS_LIST_UNSUB,__DOS_HAS_MAILING_LIST,_
> _DOS_LINK,__DOS_RCVD_MON,__DOS_REF_2_WK_DAYS,__DOS_RELAYED_EXT,__DOS_
> SINGLE_EXT_RELAY,__E_LIKE_LETTER,__E_LIKE_LETTER, [...]

Shortened that output of the _SUBTESTS_ Template Tag. Suffice to say it
showed both rules __LOWER_E and __E_LIKE_LETTER their respective
maxhits times.


> 72_active.cf:    body            __LOWER_E       /e/
> 72_active.cf:    tflags          __LOWER_E       multiple maxhits=230
> 
> 72_active.cf:    body            __E_LIKE_LETTER /<lcase_e>/
> 72_active.cf:    tflags          __E_LIKE_LETTER multiple maxhits=320
> 
> Assuming those maxhits are correct,

There are a total of 6 rules with maxhits >= 100 (or > 26 in case you
prefer), all sub-rules. Also, all living in Bill's sandbox...


> maybe we need something in the debug
> output that says __E_LIKE_LETTER (number of hits if more than 1).

With maxhits of 20+ I'd be in favor of such an abbreviation.

Applied in any case of more than a single hit, I don't mean to collapse
in cases exceeding a threshold. I just mean it's not worth bothering if
there are only instances of about 10 at most... 

Do note though that it is the same with regular (non-sub) rules: With
tflags multiple they are listed up to maxhits times in both header
style _TESTS_ and _REPORT_ as well as body style _SUMMARY_ Template
Tags.


Reminds me of the multiple and maxhits long pre-dating Rules Emporium
backhair and friends rule-sets with a similarly ridiculous tests hit
report pattern due to basically implementing these counting features in
pure meta...


-- 
char *t="\10pse\0r\0dtu\0.@ghno\x4e\xc8\x79\xf4\xab\x51\x8a\x10\xf4\xf4\xc4";
main(){ char h,m=h=*t++,*x=t+2*h,c,i,l=*x,s=0; for (i=0;i<l;i++){ i%8? c<<=1:
(c=*++x); c&128 && (s+=h); if (!(h>>=1)||!t[s+h]){ putchar(t[s]);h=m;s=0; }}}


Re: __E_LIKE_LETTER & __LOWER_E filling subtests debug

Posted by John Hardin <jh...@impsec.org>.
On Wed, 5 Jun 2019, Kevin A. McGrail wrote:

> Good point, Henrik & John.
>
> OK, I've left the output alone except for the calls from dbg so it
> shouldn't break anything in the public interface.
>
> Thoughts on this version?

Looks much safer, but I still wonder whether the repetitive hits in the 
normal output has any value. After all, masscheck only cares *whether* the 
rule hit, not *how many times* in a given message...

But I haven't performed an analysis of everything else that consumes that 
output.

> Regards,
> KAM
>
> On 6/4/2019 1:51 PM, John Hardin wrote:
>> On Tue, 4 Jun 2019, Kevin A. McGrail wrote:
>>
>>> Yes, I was thinking about that and wanting to fix uritests so well
>>> for the
>>> template.   Thanks for the feedback.  I will take another pass at it.
>>
>> Just do the deduplication without modifying the output format.
>>
>> If we want to log the hit counts, then make another function that does
>> what you did and use it for logging.
>>
>>
>>> On Tue, Jun 4, 2019, 03:23 Henrik K <he...@hege.li> wrote:
>>>
>>>>
>>>> If you want to modify debug output, you have to modify only the dbg()
>>>> output
>>>> itself.  You can't modify internal functions that have specific output
>>>> formats and start adding random strings to them.  Atleast these places
>>>> depend on the comma delimited rules:
>>>>
>>>> ./masses/mass-check:    push @tests, split(/,/,
>>>> $status->get_names_of_subtests_hit());
>>>> ./t/rule_tests.t:    my %rules_hit = map { $_ => 1 }
>>>> split(/,/,$msg->get_names_of_tests_hit()),
>>>> split(/,/,$msg->get_names_of_subtests_hit());
>>>> ./t.rules/run:  my $testsline =
>>>> $status->get_names_of_tests_hit().",".$status->get_names_of_subtests_hit();
>>>>
>>>>
>>>>
>>>>
>>>> On Tue, Jun 04, 2019 at 01:56:26AM -0400, Kevin A. McGrail wrote:
>>>>> Morning All,
>>>>>
>>>>> After a few thoughts on limits, it appears that any duplicate subtest
>>>>> hits are best combined for debug output.
>>>>>
>>>>> Any thoughts on the attached?  It looks like it will help me with rule
>>>>> development while support rules with valid but large maxhits like
>>>> __LOWER_E
>>>>>
>>>>> Regards,
>>>>> KAM
>>>>>
>>>>> On 5/31/2019 10:30 AM, Bill Cole wrote:
>>>>>> On 30 May 2019, at 20:35, Kevin A. McGrail wrote:
>>>>>>
>>>>>>> I was curious if anyone noticed the debug output for subtests has
>>>> gotten
>>>>>>> insane:
>>>>>>
>>>>>> It got a little discussion on users@ when I created those rules.
>>>>>>
>>>>>> [...]
>>>>>>
>>>>>>> 72_active.cf:    body            __LOWER_E       /e/
>>>>>>> 72_active.cf:    tflags          __LOWER_E       multiple
>>>>>>> maxhits=230
>>>>>>>
>>>>>>> 72_active.cf:    body            __E_LIKE_LETTER /<lcase_e>/
>>>>>>> 72_active.cf:    tflags          __E_LIKE_LETTER multiple
>>>>>>> maxhits=320
>>>>>>>
>>>>>>> Assuming those maxhits are correct,
>>>>>>
>>>>>> They are. In fact they were carefully tuned to catch the targeted
>>>>>> extortion spam.
>>>>>>
>>>>>>> maybe we need something in the debug
>>>>>>> output that says __E_LIKE_LETTER (number of hits if more than 1).
>>>>>>
>>>>>> That would be a useful enhancement even without my flagrant log
>>>>>> vandalism.
>>>>>>
>>>>>
>>>>> --
>>>>> Kevin A. McGrail
>>>>> Member, Apache Software Foundation
>>>>> Chair Emeritus Apache SpamAssassin Project
>>>>> https://www.linkedin.com/in/kmcgrail - 703.798.0171
>>>>>
>>>>
>>>>> Index: lib/Mail/SpamAssassin/PerMsgStatus.pm
>>>>> ===================================================================
>>>>> --- lib/Mail/SpamAssassin/PerMsgStatus.pm       (revision 1860582)
>>>>> +++ lib/Mail/SpamAssassin/PerMsgStatus.pm       (working copy)
>>>>> @@ -769,7 +769,38 @@
>>>>>  sub get_names_of_subtests_hit {
>>>>>    my ($self) = @_;
>>>>>
>>>>> -  return join(',', sort @{$self->{subtest_names_hit}});
>>>>> +  #return join(',', sort @{$self->{subtest_names_hit}});
>>>>> +
>>>>> +  #This routine prints only one instance of a subrule hit with a
>>>>> count
>>>> of how many times it hit if greater than 1
>>>>> +  my (%subtest_names_hit, $i, $key, @keys, @sorted, $string, $rule,
>>>> $total_hits, $deduplicated_hits);
>>>>> +
>>>>> +  $total_hits = scalar(@{$self->{subtest_names_hit}});
>>>>> +
>>>>> +  for ($i=0; $i < $total_hits; $i++) {
>>>>> +    $rule = ${$self->{subtest_names_hit}}[$i];
>>>>> +    $subtest_names_hit{$rule}++;
>>>>> +  }
>>>>> +
>>>>> +  foreach $key (keys %subtest_names_hit) {
>>>>> +    push (@keys, $key);
>>>>> +  }
>>>>> +  @sorted = sort @keys;
>>>>> +
>>>>> +  $deduplicated_hits = scalar(@sorted);
>>>>> +
>>>>> +  for ($i=0; $i < $deduplicated_hits; $i++) {
>>>>> +    $string .= $sorted[$i];
>>>>> +    if ($subtest_names_hit{$sorted[$i]} > 1) {
>>>>> +      $string .= "($subtest_names_hit{$sorted[$i]})"
>>>>> +    }
>>>>> +    $string .= ",";
>>>>> +  }
>>>>> +
>>>>> +  $string =~ s/,$//;
>>>>> +
>>>>> +  $string .= " (Total Subtest Hits: $total_hits / Deduplicated Total
>>>> Hits: $deduplicated_hits)";
>>>>> +
>>>>> +  return $string;
>>>>>  }
>>>>>
>>>>>
>>>> ###########################################################################

-- 
  John Hardin KA7OHZ                    http://www.impsec.org/~jhardin/
  jhardin@impsec.org    FALaholic #11174     pgpk -a jhardin@impsec.org
  key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
-----------------------------------------------------------------------
   Teach a man to fish, and he'll eat for life.
   Give him someone else's fish, and he'll vote for you.
-----------------------------------------------------------------------
  Tomorrow: the 75th anniversary of D-Day

Re: The obviously different case of subtest debug flood

Posted by Henrik K <he...@hege.li>.
On Fri, Jun 07, 2019 at 06:44:22PM +0300, Henrik K wrote:
> On Fri, Jun 07, 2019 at 11:39:12AM -0400, Kevin A. McGrail wrote:
> > On 6/7/2019 11:33 AM, Henrik K wrote:
> > > Well the information is there.  In many places.  You are saying you are
> > > consistently using things like spamassassin -t -D | grep __LOWER_E | wc -l
> > > to debug your rules?
> > 
> > Close.  I am consistently using spamassassin -t -D 2>&1 | grep -i -e
> > Content\ analysis -e KAM, for example and since I write mostly meta
> > rules, I get some really long and hard to read subtest debug lines.
> 
> Well the question was more for John.  I understand your previous problem,
> that's fine.
> 
> I don't understand how a line describing there are x more identical lines
> makes debugging any harder.  Just count 10+21, that's it.  It doesn't take
> more effort to read the whole spamassassin -D |less output manually than it
> does to create an exact grep | wc -l for the problem.

Ok let's take an example I can come up with...

spamassassin -t -D -L < testmsg 2>&1 |grep __LOWER_E
Jun  7 18:55:58.772 [8203] dbg: rules: ran body rule __LOWER_E ======> got hit: "e"
Jun  7 18:55:58.773 [8203] dbg: rules: ran body rule __LOWER_E ======> got hit: "e"
Jun  7 18:55:58.773 [8203] dbg: rules: ran body rule __LOWER_E ======> got hit: "e"
Jun  7 18:55:58.773 [8203] dbg: rules: ran body rule __LOWER_E ======> got hit: "e"
Jun  7 18:55:58.773 [8203] dbg: rules: ran body rule __LOWER_E ======> got hit: "e"
Jun  7 18:55:58.773 [8203] dbg: rules: ran body rule __LOWER_E ======> got hit: "e"
Jun  7 18:55:58.773 [8203] dbg: rules: ran body rule __LOWER_E ======> got hit: "e"
Jun  7 18:55:58.773 [8203] dbg: rules: ran body rule __LOWER_E ======> got hit: "e"
Jun  7 18:55:58.773 [8203] dbg: rules: ran body rule __LOWER_E ======> got hit: "e"
Jun  7 18:55:58.773 [8203] dbg: rules: ran body rule __LOWER_E ======> got hit: "e"
Jun  7 18:55:59.009 [8203] dbg: check: subtests=__BODY_LE_200,__BODY_TEXT_LINE(2),__DOS_RCVD_TUE,__ENV_AND_HDR_FROM_MATCH,__E_LIKE_LETTER(31),__HAS_DATE,__HAS_FROM,__HAS_MESSAGE_ID,__HAS_MSGID,__HAS_RCVD,__HAS_SUBJECT,__KAM_BODY_LENGTH_LT_1024,__KAM_BODY_LENGTH_LT_128,__KAM_BODY_LENGTH_LT_256,__KAM_BODY_LENGTH_LT_512,__KAM_DROPBOX2,__KAM_FAKEDELIVER12,__KAM_FAKEDELIVER4,__KAM_FAKEDELIVER6,__KAM_FAKEDELIVER8,__KAM_GOOGLE2_2,__KAM_HARP3,__KAM_HAS_0_URIS,__KAM_JURY3,__KAM_LOTSOFHASH,__KAM_MAILSPLOIT2,__KAM_MULTIPLE_FROM,__KAM_PAYPAL3B,__KAM_SUBJECT_SINGLEWORD,__KAM_UPS2,__KAM_WU1,__KHOP_NO_FULL_NAME,__LCL__ENV_AND_HDR_FROM_MATCH,__LCL__KAM_BODY_LENGTH_LT_1024,__LCL__KAM_BODY_LENGTH_LT_128,__LCL__KAM_BODY_LENGTH_LT_512,__LOWER_E(31),__MISSING_REF,__MISSING_REPLY,__MSGID_OK_HOST,__MSOE_MID_WRONG_CASE,__NONEMPTY_BODY,__NOT_SPOOFED,__RB_LE_200,__SANE_MSGID,__SINGLE_WORD_LINE,__SINGLE_WORD_SUBJ,__SUBJ_SHORT,__TO_NO_ARROWS_R (Total Subtest Hits: 110 / Deduplicated Total Hits: 49)

Perhaps you are expecting to see much more 'ran body rule' lines, which is a
bit questionable practise.  Do you really want to go browse 200 identical
lines?

Would a compatible solution be mentioning later duplicates in the same debug line so it can be seen in grep? (notice the last one)

Jun  7 18:55:58.772 [8203] dbg: rules: ran body rule __LOWER_E ======> got hit: "e"
Jun  7 18:55:58.773 [8203] dbg: rules: ran body rule __LOWER_E ======> got hit: "e"
Jun  7 18:55:58.773 [8203] dbg: rules: ran body rule __LOWER_E ======> got hit: "e"
Jun  7 18:55:58.773 [8203] dbg: rules: ran body rule __LOWER_E ======> got hit: "e"
Jun  7 18:55:58.773 [8203] dbg: rules: ran body rule __LOWER_E ======> got hit: "e"
Jun  7 18:55:58.773 [8203] dbg: rules: ran body rule __LOWER_E ======> got hit: "e"
Jun  7 18:55:58.773 [8203] dbg: rules: ran body rule __LOWER_E ======> got hit: "e"
Jun  7 18:55:58.773 [8203] dbg: rules: ran body rule __LOWER_E ======> got hit: "e"
Jun  7 18:55:58.773 [8203] dbg: rules: ran body rule __LOWER_E ======> got hit: "e"
Jun  7 18:55:58.773 [8203] dbg: rules: ran body rule __LOWER_E ======> got hit: "e" [... message repeated 21 more times]
Jun  7 18:55:59.009 [8203] dbg: check: subtests=__BODY_LE_200,__BODY_TEXT_LINE(2),__DOS_RCVD_TUE,__ENV_AND_HDR_FROM_MATCH,__E_LIKE_LETTER(31),__HAS_DATE,__HAS_FROM,__HAS_MESSAGE_ID,__HAS_MSGID,__HAS_RCVD,__HAS_SUBJECT,__KAM_BODY_LENGTH_LT_1024,__KAM_BODY_LENGTH_LT_128,__KAM_BODY_LENGTH_LT_256,__KAM_BODY_LENGTH_LT_512,__KAM_DROPBOX2,__KAM_FAKEDELIVER12,__KAM_FAKEDELIVER4,__KAM_FAKEDELIVER6,__KAM_FAKEDELIVER8,__KAM_GOOGLE2_2,__KAM_HARP3,__KAM_HAS_0_URIS,__KAM_JURY3,__KAM_LOTSOFHASH,__KAM_MAILSPLOIT2,__KAM_MULTIPLE_FROM,__KAM_PAYPAL3B,__KAM_SUBJECT_SINGLEWORD,__KAM_UPS2,__KAM_WU1,__KHOP_NO_FULL_NAME,__LCL__ENV_AND_HDR_FROM_MATCH,__LCL__KAM_BODY_LENGTH_LT_1024,__LCL__KAM_BODY_LENGTH_LT_128,__LCL__KAM_BODY_LENGTH_LT_512,__LOWER_E(31),__MISSING_REF,__MISSING_REPLY,__MSGID_OK_HOST,__MSOE_MID_WRONG_CASE,__NONEMPTY_BODY,__NOT_SPOOFED,__RB_LE_200,__SANE_MSGID,__SINGLE_WORD_LINE,__SINGLE_WORD_SUBJ,__SUBJ_SHORT,__TO_NO_ARROWS_R (Total Subtest Hits: 110 / Deduplicated Total Hits: 49)


Re: The obviously different case of subtest debug flood

Posted by Henrik K <he...@hege.li>.
On Fri, Jun 07, 2019 at 11:39:12AM -0400, Kevin A. McGrail wrote:
> On 6/7/2019 11:33 AM, Henrik K wrote:
> > Well the information is there.  In many places.  You are saying you are
> > consistently using things like spamassassin -t -D | grep __LOWER_E | wc -l
> > to debug your rules?
> 
> Close.  I am consistently using spamassassin -t -D 2>&1 | grep -i -e
> Content\ analysis -e KAM, for example and since I write mostly meta
> rules, I get some really long and hard to read subtest debug lines.

Well the question was more for John.  I understand your previous problem,
that's fine.

I don't understand how a line describing there are x more identical lines
makes debugging any harder.  Just count 10+21, that's it.  It doesn't take
more effort to read the whole spamassassin -D |less output manually than it
does to create an exact grep | wc -l for the problem.


Re: The obviously different case of subtest debug flood

Posted by John Hardin <jh...@impsec.org>.
On Sat, 8 Jun 2019, Kevin A. McGrail wrote:

> On 6/8/2019 1:27 AM, Henrik K wrote:
>> On Fri, Jun 07, 2019 at 11:53:24AM -0700, John Hardin wrote:
>>> ...where we're not collapsing on solely the rule name, I'd accept that.
>> I guess this was the confusion then.  I've been talking about identical /
>> duplicate lines, as is the code.  Of course it won't collapse only on some
>> part of the line.  This same thing is a common default syslog feature, so I
>> thought it would be pretty clear.
>>
> Yes, I thought *immediately* of the syslog deduplication anti-flood
> feature and thought it was good.

Agreed. I was only stressing the point for clarity because of the 
different contexts.

-- 
  John Hardin KA7OHZ                    http://www.impsec.org/~jhardin/
  jhardin@impsec.org    FALaholic #11174     pgpk -a jhardin@impsec.org
  key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
-----------------------------------------------------------------------
   Vista: because the audio experience is *far* more important than
   network throughput.
-----------------------------------------------------------------------
  2 days until the 52nd anniversary of Israel's victory in the Six-Day War

Re: The obviously different case of subtest debug flood

Posted by "Kevin A. McGrail" <km...@apache.org>.
On 6/8/2019 1:27 AM, Henrik K wrote:
> On Fri, Jun 07, 2019 at 11:53:24AM -0700, John Hardin wrote:
>> ...where we're not collapsing on solely the rule name, I'd accept that.
> I guess this was the confusion then.  I've been talking about identical /
> duplicate lines, as is the code.  Of course it won't collapse only on some
> part of the line.  This same thing is a common default syslog feature, so I
> thought it would be pretty clear.
>
Yes, I thought *immediately* of the syslog deduplication anti-flood
feature and thought it was good.

-- 
Kevin A. McGrail
Member, Apache Software Foundation
Chair Emeritus Apache SpamAssassin Project
https://www.linkedin.com/in/kmcgrail - 703.798.0171


Re: The obviously different case of subtest debug flood

Posted by Henrik K <he...@hege.li>.
On Fri, Jun 07, 2019 at 11:53:24AM -0700, John Hardin wrote:
> 
> ...where we're not collapsing on solely the rule name, I'd accept that.

I guess this was the confusion then.  I've been talking about identical /
duplicate lines, as is the code.  Of course it won't collapse only on some
part of the line.  This same thing is a common default syslog feature, so I
thought it would be pretty clear.


Re: __E_LIKE_LETTER & __LOWER_E filling subtests debug

Posted by "Kevin A. McGrail" <km...@apache.org>.
That was in the original post but to restate the issue, I use debug
output to write rules.  With the addition of a few rules with Maxhits in
the several hundred like __E_LIKE_LETTER, trying to read what subtests a
message hits on looks like this:

Jun  7 09:52:37.553 [23945] dbg: check:
subtests=__ANY_TEXT_ATTACH,__ANY_TEXT_ATTACH_DOC,__BODY_TEXT_LINE,__BODY_TEXT_LINE,__BODY_TEXT_LINE,__BUGGED_IMG,__CBJ_GiveMeABreak2,__CLICK_HERE,__CT,__CTYPE_HAS_BOUNDARY,__CTYPE_MULTIPART_ALT,__CTYPE_MULTIPART_ANY,__DEAL,__DKIMWL_WL_BL,__DKIM_DEPENDABLE,__DKIM_EXISTS,__DOS_HAS_ANY_URI,__DOS_HAS_LIST_UNSUB,__DOS_RCVD_WED,__DOS_RELAYED_EXT,__DOS_SINGLE_EXT_RELAY,__END_FUTURE_EMAILS,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__E_LIKE_LETTER,__FROM_FMBLA_NEWDOM,__FSL_HAS_LIST_UNSUB,__HAS_ANY_URI,__HAS_DATE,__HAS_DKIM_SIGHD,__HAS_DOMAINKEY_SIG,__HAS_FROM,__HAS_HREF,__HAS_MESSAGE_ID,__HAS_MSGID,__HAS_RCVD,__HAS_SUBJECT,__HAS_TO,__HAS_URI,__HAVE_BOUNCE_RELAYS,__HTML_LINK_IMAGE,__HUSH_HUSH,__KAM_COUNT_URIS,__KAM_COUNT_URIS,__KAM_COUNT_URIS,__KAM_COUNT_URIS,__KAM_COUNT_URIS,__KAM_COUNT_URIS,__KAM_COUNT_URIS,__KAM_COUNT_URIS,__KAM_DROPBOX2,__KAM_FAKEDELIVER12,__KAM_FAKEDELIVER4,__KAM_FAKEDELIVER6,__KAM_FAKEDELIVER8,__KAM_FUN1,__KAM_FUN2,__KAM_FUN3,__KAM_FUN4,__KAM_GENERICHEALTH3,__KAM_GOOGLE2_2,__KAM_HARP3,__KAM_HAS_1_URIS,__KAM_HAS_2_URIS,__KAM_HAS_3_URIS,__KAM_HAS_4_URIS,__KAM_HAS_5_URIS,__KAM_HUGEIMGSRC,__KAM_JURY3,__KAM_LOTSOFHASH,__KAM_LOTTO3,__KAM_MAILSPLOIT2,__KAM_MULTIPLE_FROM,__KAM_PAYPAL3B,__KAM_RPTR_PASSED,__KAM_SEO7,__KAM_TIME4,__KAM_UPS2,__KAM_URIBL_PCCC,__KAM_WU1,__KB_WAM_FROM_NAME_SINGLEWORD,__LAST_EXTERNAL_RELAY_NO_AUTH,__LAST_UNTRUSTED_RELAY_NO_AUTH,__LIST_PARTIAL,__LOCAL_PP_NONPPURL,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__MIME_HTML,__MIME_VERSION,__MISSING_REF,__MISSING_REPLY,__MSGID_OK_HOST,__NONEMPTY_BODY,__NOT_A_PERSON,__NOT_SPOOFED,__PDS_NEWDOMAIN,__RB_GT_200,__RCD_RDNS_MAIL,__RCD_RDNS_MAIL_MESSY,__RCVD_IN_RPBL,__RCVD_IN_SORBS,__RCVD_IN_ZEN,__RP_MATCHES_RCVD,__SANE_MSGID,__SUBJ_NOT_SHORT,__SUBSCRIPTION_INFO,__TAG_EXISTS_BODY,__TAG_EXISTS_CENTER,__TAG_EXISTS_HEAD,__TAG_EXISTS_HTML,__TAG_EXISTS_META,__TOCC_EXISTS,__TVD_MIME_ATT_TP,__USING_VERP1

With my patch for the debug output, it looks like this (note the 320
deduplicated __E_LIKE_LETTER hits, for example:
Jun  7 09:55:43.872 [24500] dbg: check:
subtests=__ANY_TEXT_ATTACH,__ANY_TEXT_ATTACH_DOC,__BODY_TEXT_LINE(3),__BUGGED_IMG,__CLICK_HERE,__CT,__CTYPE_HAS_BOUNDARY,__CTYPE_MULTIPART_ALT,__CTYPE_MULTIPART_ANY,__DEAL,__DKIMWL_WL_BL,__DKIM_DEPENDABLE,__DKIM_EXISTS,__DOS_HAS_ANY_URI,__DOS_HAS_LIST_UNSUB,__DOS_RCVD_WED,__DOS_RELAYED_EXT,__DOS_SINGLE_EXT_RELAY,__END_FUTURE_EMAILS,__E_LIKE_LETTER(320),__FROM_FMBLA_NEWDOM,__FSL_HAS_LIST_UNSUB,__HAS_ANY_URI,__HAS_DATE,__HAS_DKIM_SIGHD,__HAS_DOMAINKEY_SIG,__HAS_FROM,__HAS_HREF,__HAS_MESSAGE_ID,__HAS_MSGID,__HAS_RCVD,__HAS_SUBJECT,__HAS_TO,__HAS_URI,__HTML_LINK_IMAGE,__HUSH_HUSH,__KAM_COUNT_URIS(8),__KAM_DROPBOX2,__KAM_FAKEDELIVER12,__KAM_FAKEDELIVER4,__KAM_FAKEDELIVER6,__KAM_FAKEDELIVER8,__KAM_GENERICHEALTH3,__KAM_GOOGLE2_2,__KAM_HARP3,__KAM_HAS_1_URIS,__KAM_HAS_2_URIS,__KAM_HAS_3_URIS,__KAM_HAS_4_URIS,__KAM_HAS_5_URIS,__KAM_HUGEIMGSRC,__KAM_JURY3,__KAM_LOTSOFHASH,__KAM_LOTTO3,__KAM_MAILSPLOIT2,__KAM_MULTIPLE_FROM,__KAM_PAYPAL3B,__KAM_SEO7,__KAM_TIME4,__KAM_UPS2,__KAM_WU1,__KB_WAM_FROM_NAME_SINGLEWORD,__LAST_EXTERNAL_RELAY_NO_AUTH,__LAST_UNTRUSTED_RELAY_NO_AUTH,__LIST_PARTIAL,__LOCAL_PP_NONPPURL,__LOWER_E(230),__MIME_HTML,__MIME_VERSION,__MISSING_REF,__MISSING_REPLY,__MSGID_OK_HOST,__NONEMPTY_BODY,__NOT_A_PERSON,__NOT_SPOOFED,__PDS_NEWDOMAIN,__RB_GT_200,__RCD_RDNS_MAIL,__RCD_RDNS_MAIL_MESSY,__RCVD_IN_SORBS,__RP_MATCHES_RCVD,__SANE_MSGID,__SUBJ_NOT_SHORT,__SUBSCRIPTION_INFO,__TAG_EXISTS_BODY,__TAG_EXISTS_CENTER,__TAG_EXISTS_HEAD,__TAG_EXISTS_HTML,__TAG_EXISTS_META,__TOCC_EXISTS,__TVD_MIME_ATT_TP,__USING_VERP1
(Total Subtest Hits: 649 / Deduplicated Total Hits: 92)

Thanks for the improvement on other duplicates you committed.  That will
help to.

My change for debug output is committed now for 3.4 and trunk.
Committed revision 1860766.

Regards,
KAM
On 6/7/2019 2:50 AM, Henrik K wrote:
> What does "unreadable for rule analysis" mean?  Surely no one is actually
> manually reading such lines one rule at a time?  Computers can check and
> grep for you.. ;-)
>
> I think this needs a little bit more of thought what we really want to
> accomplish here and maybe do it in a bug along with the new templates and
> stuff if needed..
>
>
>
> On Thu, Jun 06, 2019 at 07:48:02AM -0400, Kevin A. McGrail wrote:
>> That is a frightening one liner.  Should we use it?
>>
>> As for the more output comment, if you have emails with 300 lower case e's, you
>> get 300 hits for the subtext.  It is unreadable for rule analysis.
>>
>> As for modifying the normal output, I have no idea if anyone out there is using
>> the public routine so better to be safe.
>>
>> I didn't find a tag for subtests either. That might be a good 4.0 addition.
>>
>> Regards, KAM
>>
>> On Thu, Jun 6, 2019, 01:30 Henrik K <[1...@hege.li> wrote:
>>
>>
>>     Well in theory you see _more_ debug output now when there are no
>>     duplicates,
>>     due to the stats string..  honestly atleast I wouldn't care about that.
>>     Feel free to vote.
>>
>>     As a silly morning exercise, here's a one-liner that compacts stuff :-P
>>
>>     my $foo = '__A,__B,__C,__C,__C,__CC,__D,__D,__E,__E';
>>     my $m; $foo =~ s/([^,]+)(?{$m=1})(?:,\1(?=,|$)(?{$m++}))+/"$1($m)"/eg;
>>
>>     __A,__B,__C(3),__CC,__D(2),__E(2)
>>
>>
>>     On Wed, Jun 05, 2019 at 08:25:00PM -0400, Kevin A. McGrail wrote:
>>     > Good point, Henrik & John.
>>     >
>>     > OK, I've left the output alone except for the calls from dbg so it
>>     > shouldn't break anything in the public interface.
>>     >
>>     > Thoughts on this version?
>>     >
>>     > Regards,
>>     > KAM
>>     >
>>     > On 6/4/2019 1:51 PM, John Hardin wrote:
>>     > > On Tue, 4 Jun 2019, Kevin A. McGrail wrote:
>>     > >
>>     > >> Yes, I was thinking about that and wanting to fix uritests so well
>>     > >> for the
>>     > >> template.   Thanks for the feedback.  I will take another pass at it.
>>     > >
>>     > > Just do the deduplication without modifying the output format.
>>     > >
>>     > > If we want to log the hit counts, then make another function that does
>>     > > what you did and use it for logging.
>>     > >
>>     > >
>>     > >> On Tue, Jun 4, 2019, 03:23 Henrik K <[2...@hege.li> wrote:
>>     > >>
>>     > >>>
>>     > >>> If you want to modify debug output, you have to modify only the dbg()
>>     > >>> output
>>     > >>> itself.  You can't modify internal functions that have specific
>>     output
>>     > >>> formats and start adding random strings to them.  Atleast these
>>     places
>>     > >>> depend on the comma delimited rules:
>>     > >>>
>>     > >>> ./masses/mass-check:    push @tests, split(/,/,
>>     > >>> $status->get_names_of_subtests_hit());
>>     > >>> ./t/rule_tests.t:    my %rules_hit = map { $_ => 1 }
>>     > >>> split(/,/,$msg->get_names_of_tests_hit()),
>>     > >>> split(/,/,$msg->get_names_of_subtests_hit());
>>     > >>> ./t.rules/run:  my $testsline =
>>     > >>> $status->get_names_of_tests_hit().",".$status->
>>     get_names_of_subtests_hit();
>>     > >>>
>>     > >>>
>>     > >>>
>>     > >>>
>>     > >>> On Tue, Jun 04, 2019 at 01:56:26AM -0400, Kevin A. McGrail wrote:
>>     > >>>> Morning All,
>>     > >>>>
>>     > >>>> After a few thoughts on limits, it appears that any duplicate
>>     subtest
>>     > >>>> hits are best combined for debug output.
>>     > >>>>
>>     > >>>> Any thoughts on the attached?  It looks like it will help me with
>>     rule
>>     > >>>> development while support rules with valid but large maxhits like
>>     > >>> __LOWER_E
>>     > >>>>
>>     > >>>> Regards,
>>     > >>>> KAM
>>     > >>>>
>>     > >>>> On 5/31/2019 10:30 AM, Bill Cole wrote:
>>     > >>>>> On 30 May 2019, at 20:35, Kevin A. McGrail wrote:
>>     > >>>>>
>>     > >>>>>> I was curious if anyone noticed the debug output for subtests has
>>     > >>> gotten
>>     > >>>>>> insane:
>>     > >>>>>
>>     > >>>>> It got a little discussion on users@ when I created those rules.
>>     > >>>>>
>>     > >>>>> [...]
>>     > >>>>>
>>     > >>>>>> [3]72_active.cf:    body            __LOWER_E       /e/
>>     > >>>>>> [4]72_active.cf:    tflags          __LOWER_E       multiple
>>     > >>>>>> maxhits=230
>>     > >>>>>>
>>     > >>>>>> [5]72_active.cf:    body            __E_LIKE_LETTER /<lcase_e>/
>>     > >>>>>> [6]72_active.cf:    tflags          __E_LIKE_LETTER multiple
>>     > >>>>>> maxhits=320
>>     > >>>>>>
>>     > >>>>>> Assuming those maxhits are correct,
>>     > >>>>>
>>     > >>>>> They are. In fact they were carefully tuned to catch the targeted
>>     > >>>>> extortion spam.
>>     > >>>>>
>>     > >>>>>> maybe we need something in the debug
>>     > >>>>>> output that says __E_LIKE_LETTER (number of hits if more than 1).
>>     > >>>>>
>>     > >>>>> That would be a useful enhancement even without my flagrant log
>>     > >>>>> vandalism.
>>     > >>>>>
>>     > >>>>
>>     > >>>> --
>>     > >>>> Kevin A. McGrail
>>     > >>>> Member, Apache Software Foundation
>>     > >>>> Chair Emeritus Apache SpamAssassin Project
>>     > >>>> [7]https://www.linkedin.com/in/kmcgrail - 703.798.0171
>>     > >>>>
>>     > >>>
>>     > >>>> Index: lib/Mail/SpamAssassin/PerMsgStatus.pm
>>     > >>>> ===================================================================
>>     > >>>> --- lib/Mail/SpamAssassin/PerMsgStatus.pm       (revision 1860582)
>>     > >>>> +++ lib/Mail/SpamAssassin/PerMsgStatus.pm       (working copy)
>>     > >>>> @@ -769,7 +769,38 @@
>>     > >>>>  sub get_names_of_subtests_hit {
>>     > >>>>    my ($self) = @_;
>>     > >>>>
>>     > >>>> -  return join(',', sort @{$self->{subtest_names_hit}});
>>     > >>>> +  #return join(',', sort @{$self->{subtest_names_hit}});
>>     > >>>> +
>>     > >>>> +  #This routine prints only one instance of a subrule hit with a
>>     > >>>> count
>>     > >>> of how many times it hit if greater than 1
>>     > >>>> +  my (%subtest_names_hit, $i, $key, @keys, @sorted, $string, $rule,
>>     > >>> $total_hits, $deduplicated_hits);
>>     > >>>> +
>>     > >>>> +  $total_hits = scalar(@{$self->{subtest_names_hit}});
>>     > >>>> +
>>     > >>>> +  for ($i=0; $i < $total_hits; $i++) {
>>     > >>>> +    $rule = ${$self->{subtest_names_hit}}[$i];
>>     > >>>> +    $subtest_names_hit{$rule}++;
>>     > >>>> +  }
>>     > >>>> +
>>     > >>>> +  foreach $key (keys %subtest_names_hit) {
>>     > >>>> +    push (@keys, $key);
>>     > >>>> +  }
>>     > >>>> +  @sorted = sort @keys;
>>     > >>>> +
>>     > >>>> +  $deduplicated_hits = scalar(@sorted);
>>     > >>>> +
>>     > >>>> +  for ($i=0; $i < $deduplicated_hits; $i++) {
>>     > >>>> +    $string .= $sorted[$i];
>>     > >>>> +    if ($subtest_names_hit{$sorted[$i]} > 1) {
>>     > >>>> +      $string .= "($subtest_names_hit{$sorted[$i]})"
>>     > >>>> +    }
>>     > >>>> +    $string .= ",";
>>     > >>>> +  }
>>     > >>>> +
>>     > >>>> +  $string =~ s/,$//;
>>     > >>>> +
>>     > >>>> +  $string .= " (Total Subtest Hits: $total_hits / Deduplicated
>>     Total
>>     > >>> Hits: $deduplicated_hits)";
>>     > >>>> +
>>     > >>>> +  return $string;
>>     > >>>>  }
>>     > >>>>
>>     > >>>>
>>     > >>> #####################################################################
>>     ######
>>     > >>>
>>     > >>>
>>     > >>>
>>     > >>
>>     > >
>>     >
>>     > --
>>     > Kevin A. McGrail
>>     > Member, Apache Software Foundation
>>     > Chair Emeritus Apache SpamAssassin Project
>>     > [8]https://www.linkedin.com/in/kmcgrail - 703.798.0171
>>     >
>>
>>     > Index: lib/Mail/SpamAssassin/PerMsgStatus.pm
>>     > ===================================================================
>>     > --- lib/Mail/SpamAssassin/PerMsgStatus.pm       (revision 1860582)
>>     > +++ lib/Mail/SpamAssassin/PerMsgStatus.pm       (working copy)
>>     > @@ -398,7 +398,7 @@
>>     >    dbg("check: is spam? score=".$self->{score}.
>>     >                          " required=".$self->{conf}->{required_score});
>>     >    dbg("check: tests=".$self->get_names_of_tests_hit());
>>     > -  dbg("check: subtests=".$self->get_names_of_subtests_hit());
>>     > +  dbg("check: subtests=".$self->get_names_of_subtests_hit("dbg"));
>>     >    $self->{is_spam} = $self->is_spam();
>>     > 
>>     >    $self->{main}->{resolver}->bgabort();
>>     > @@ -764,12 +764,52 @@
>>     >  normally-hidden rules, which score 0 and have names beginning with two
>>     >  underscores, used in meta rules.
>>     > 
>>     > +If a parameter of dbg is passed, the output will be more condensed and
>>     > +sub-tests with multiple hits reduced to one entry with the number of
>>     hits
>>     > +in parentheses. Some information is also added at the end regarding the
>>     > +multiple hits.
>>     > +
>>     >  =cut
>>     > 
>>     >  sub get_names_of_subtests_hit {
>>     > -  my ($self) = @_;
>>     > +  my ($self, $mode) = @_;
>>     > 
>>     > -  return join(',', sort @{$self->{subtest_names_hit}});
>>     > +  if (defined $mode && $mode eq 'dbg') {
>>     > +    #This routine prints only one instance of a subrule hit with a count
>>     of how many times it hit if greater than 1
>>     > +    my (%subtest_names_hit, $i, $key, @keys, @sorted, $string, $rule,
>>     $total_hits, $deduplicated_hits); 
>>     > + 
>>     > +    $total_hits = scalar(@{$self->{subtest_names_hit}});
>>     > + 
>>     > +    for ($i=0; $i < $total_hits; $i++) {
>>     > +      $rule = ${$self->{subtest_names_hit}}[$i];
>>     > +      $subtest_names_hit{$rule}++;
>>     > +    }
>>     > + 
>>     > +    foreach $key (keys %subtest_names_hit) {
>>     > +      push (@keys, $key);
>>     > +    }
>>     > +    @sorted = sort @keys;
>>     > + 
>>     > +    $deduplicated_hits = scalar(@sorted);
>>     > + 
>>     > +    for ($i=0; $i < $deduplicated_hits; $i++) {
>>     > +      $string .= $sorted[$i];
>>     > +      if ($subtest_names_hit{$sorted[$i]} > 1) {
>>     > +        $string .= "($subtest_names_hit{$sorted[$i]})"
>>     > +      }
>>     > +      $string .= ",";
>>     > +    }
>>     > + 
>>     > +    $string =~ s/,$//;
>>     > + 
>>     > +    $string .= " (Total Subtest Hits: $total_hits / Deduplicated Total
>>     Hits: $deduplicated_hits)";
>>     > + 
>>     > +    return $string;
>>     > +
>>     > +  } else {
>>     > +    #return the simpler string with duplicates and commas
>>     > +    return join(',', sort @{$self->{subtest_names_hit}});
>>     > +  }
>>     >  }
>>     > 
>>     >  ########################################################################
>>     ###
>>
>>
>>
>> On Thu, Jun 6, 2019, 01:30 Henrik K <[9...@hege.li> wrote:
>>
>>
>>     Well in theory you see _more_ debug output now when there are no
>>     duplicates,
>>     due to the stats string..  honestly atleast I wouldn't care about that.
>>     Feel free to vote.
>>
>>     As a silly morning exercise, here's a one-liner that compacts stuff :-P
>>
>>     my $foo = '__A,__B,__C,__C,__C,__CC,__D,__D,__E,__E';
>>     my $m; $foo =~ s/([^,]+)(?{$m=1})(?:,\1(?=,|$)(?{$m++}))+/"$1($m)"/eg;
>>
>>     __A,__B,__C(3),__CC,__D(2),__E(2)
>>
>>
>>     On Wed, Jun 05, 2019 at 08:25:00PM -0400, Kevin A. McGrail wrote:
>>     > Good point, Henrik & John.
>>     >
>>     > OK, I've left the output alone except for the calls from dbg so it
>>     > shouldn't break anything in the public interface.
>>     >
>>     > Thoughts on this version?
>>     >
>>     > Regards,
>>     > KAM
>>     >
>>     > On 6/4/2019 1:51 PM, John Hardin wrote:
>>     > > On Tue, 4 Jun 2019, Kevin A. McGrail wrote:
>>     > >
>>     > >> Yes, I was thinking about that and wanting to fix uritests so well
>>     > >> for the
>>     > >> template.   Thanks for the feedback.  I will take another pass at it.
>>     > >
>>     > > Just do the deduplication without modifying the output format.
>>     > >
>>     > > If we want to log the hit counts, then make another function that does
>>     > > what you did and use it for logging.
>>     > >
>>     > >
>>     > >> On Tue, Jun 4, 2019, 03:23 Henrik K <[1...@hege.li> wrote:
>>     > >>
>>     > >>>
>>     > >>> If you want to modify debug output, you have to modify only the dbg()
>>     > >>> output
>>     > >>> itself.  You can't modify internal functions that have specific
>>     output
>>     > >>> formats and start adding random strings to them.  Atleast these
>>     places
>>     > >>> depend on the comma delimited rules:
>>     > >>>
>>     > >>> ./masses/mass-check:    push @tests, split(/,/,
>>     > >>> $status->get_names_of_subtests_hit());
>>     > >>> ./t/rule_tests.t:    my %rules_hit = map { $_ => 1 }
>>     > >>> split(/,/,$msg->get_names_of_tests_hit()),
>>     > >>> split(/,/,$msg->get_names_of_subtests_hit());
>>     > >>> ./t.rules/run:  my $testsline =
>>     > >>> $status->get_names_of_tests_hit().",".$status->
>>     get_names_of_subtests_hit();
>>     > >>>
>>     > >>>
>>     > >>>
>>     > >>>
>>     > >>> On Tue, Jun 04, 2019 at 01:56:26AM -0400, Kevin A. McGrail wrote:
>>     > >>>> Morning All,
>>     > >>>>
>>     > >>>> After a few thoughts on limits, it appears that any duplicate
>>     subtest
>>     > >>>> hits are best combined for debug output.
>>     > >>>>
>>     > >>>> Any thoughts on the attached?  It looks like it will help me with
>>     rule
>>     > >>>> development while support rules with valid but large maxhits like
>>     > >>> __LOWER_E
>>     > >>>>
>>     > >>>> Regards,
>>     > >>>> KAM
>>     > >>>>
>>     > >>>> On 5/31/2019 10:30 AM, Bill Cole wrote:
>>     > >>>>> On 30 May 2019, at 20:35, Kevin A. McGrail wrote:
>>     > >>>>>
>>     > >>>>>> I was curious if anyone noticed the debug output for subtests has
>>     > >>> gotten
>>     > >>>>>> insane:
>>     > >>>>>
>>     > >>>>> It got a little discussion on users@ when I created those rules.
>>     > >>>>>
>>     > >>>>> [...]
>>     > >>>>>
>>     > >>>>>> [11]72_active.cf:    body            __LOWER_E       /e/
>>     > >>>>>> [12]72_active.cf:    tflags          __LOWER_E       multiple
>>     > >>>>>> maxhits=230
>>     > >>>>>>
>>     > >>>>>> [13]72_active.cf:    body            __E_LIKE_LETTER /<lcase_e>/
>>     > >>>>>> [14]72_active.cf:    tflags          __E_LIKE_LETTER multiple
>>     > >>>>>> maxhits=320
>>     > >>>>>>
>>     > >>>>>> Assuming those maxhits are correct,
>>     > >>>>>
>>     > >>>>> They are. In fact they were carefully tuned to catch the targeted
>>     > >>>>> extortion spam.
>>     > >>>>>
>>     > >>>>>> maybe we need something in the debug
>>     > >>>>>> output that says __E_LIKE_LETTER (number of hits if more than 1).
>>     > >>>>>
>>     > >>>>> That would be a useful enhancement even without my flagrant log
>>     > >>>>> vandalism.
>>     > >>>>>
>>     > >>>>
>>     > >>>> --
>>     > >>>> Kevin A. McGrail
>>     > >>>> Member, Apache Software Foundation
>>     > >>>> Chair Emeritus Apache SpamAssassin Project
>>     > >>>> [15]https://www.linkedin.com/in/kmcgrail - 703.798.0171
>>     > >>>>
>>     > >>>
>>     > >>>> Index: lib/Mail/SpamAssassin/PerMsgStatus.pm
>>     > >>>> ===================================================================
>>     > >>>> --- lib/Mail/SpamAssassin/PerMsgStatus.pm       (revision 1860582)
>>     > >>>> +++ lib/Mail/SpamAssassin/PerMsgStatus.pm       (working copy)
>>     > >>>> @@ -769,7 +769,38 @@
>>     > >>>>  sub get_names_of_subtests_hit {
>>     > >>>>    my ($self) = @_;
>>     > >>>>
>>     > >>>> -  return join(',', sort @{$self->{subtest_names_hit}});
>>     > >>>> +  #return join(',', sort @{$self->{subtest_names_hit}});
>>     > >>>> +
>>     > >>>> +  #This routine prints only one instance of a subrule hit with a
>>     > >>>> count
>>     > >>> of how many times it hit if greater than 1
>>     > >>>> +  my (%subtest_names_hit, $i, $key, @keys, @sorted, $string, $rule,
>>     > >>> $total_hits, $deduplicated_hits);
>>     > >>>> +
>>     > >>>> +  $total_hits = scalar(@{$self->{subtest_names_hit}});
>>     > >>>> +
>>     > >>>> +  for ($i=0; $i < $total_hits; $i++) {
>>     > >>>> +    $rule = ${$self->{subtest_names_hit}}[$i];
>>     > >>>> +    $subtest_names_hit{$rule}++;
>>     > >>>> +  }
>>     > >>>> +
>>     > >>>> +  foreach $key (keys %subtest_names_hit) {
>>     > >>>> +    push (@keys, $key);
>>     > >>>> +  }
>>     > >>>> +  @sorted = sort @keys;
>>     > >>>> +
>>     > >>>> +  $deduplicated_hits = scalar(@sorted);
>>     > >>>> +
>>     > >>>> +  for ($i=0; $i < $deduplicated_hits; $i++) {
>>     > >>>> +    $string .= $sorted[$i];
>>     > >>>> +    if ($subtest_names_hit{$sorted[$i]} > 1) {
>>     > >>>> +      $string .= "($subtest_names_hit{$sorted[$i]})"
>>     > >>>> +    }
>>     > >>>> +    $string .= ",";
>>     > >>>> +  }
>>     > >>>> +
>>     > >>>> +  $string =~ s/,$//;
>>     > >>>> +
>>     > >>>> +  $string .= " (Total Subtest Hits: $total_hits / Deduplicated
>>     Total
>>     > >>> Hits: $deduplicated_hits)";
>>     > >>>> +
>>     > >>>> +  return $string;
>>     > >>>>  }
>>     > >>>>
>>     > >>>>
>>     > >>> #####################################################################
>>     ######
>>     > >>>
>>     > >>>
>>     > >>>
>>     > >>
>>     > >
>>     >
>>     > --
>>     > Kevin A. McGrail
>>     > Member, Apache Software Foundation
>>     > Chair Emeritus Apache SpamAssassin Project
>>     > [16]https://www.linkedin.com/in/kmcgrail - 703.798.0171
>>     >
>>
>>     > Index: lib/Mail/SpamAssassin/PerMsgStatus.pm
>>     > ===================================================================
>>     > --- lib/Mail/SpamAssassin/PerMsgStatus.pm       (revision 1860582)
>>     > +++ lib/Mail/SpamAssassin/PerMsgStatus.pm       (working copy)
>>     > @@ -398,7 +398,7 @@
>>     >    dbg("check: is spam? score=".$self->{score}.
>>     >                          " required=".$self->{conf}->{required_score});
>>     >    dbg("check: tests=".$self->get_names_of_tests_hit());
>>     > -  dbg("check: subtests=".$self->get_names_of_subtests_hit());
>>     > +  dbg("check: subtests=".$self->get_names_of_subtests_hit("dbg"));
>>     >    $self->{is_spam} = $self->is_spam();
>>     > 
>>     >    $self->{main}->{resolver}->bgabort();
>>     > @@ -764,12 +764,52 @@
>>     >  normally-hidden rules, which score 0 and have names beginning with two
>>     >  underscores, used in meta rules.
>>     > 
>>     > +If a parameter of dbg is passed, the output will be more condensed and
>>     > +sub-tests with multiple hits reduced to one entry with the number of
>>     hits
>>     > +in parentheses. Some information is also added at the end regarding the
>>     > +multiple hits.
>>     > +
>>     >  =cut
>>     > 
>>     >  sub get_names_of_subtests_hit {
>>     > -  my ($self) = @_;
>>     > +  my ($self, $mode) = @_;
>>     > 
>>     > -  return join(',', sort @{$self->{subtest_names_hit}});
>>     > +  if (defined $mode && $mode eq 'dbg') {
>>     > +    #This routine prints only one instance of a subrule hit with a count
>>     of how many times it hit if greater than 1
>>     > +    my (%subtest_names_hit, $i, $key, @keys, @sorted, $string, $rule,
>>     $total_hits, $deduplicated_hits); 
>>     > + 
>>     > +    $total_hits = scalar(@{$self->{subtest_names_hit}});
>>     > + 
>>     > +    for ($i=0; $i < $total_hits; $i++) {
>>     > +      $rule = ${$self->{subtest_names_hit}}[$i];
>>     > +      $subtest_names_hit{$rule}++;
>>     > +    }
>>     > + 
>>     > +    foreach $key (keys %subtest_names_hit) {
>>     > +      push (@keys, $key);
>>     > +    }
>>     > +    @sorted = sort @keys;
>>     > + 
>>     > +    $deduplicated_hits = scalar(@sorted);
>>     > + 
>>     > +    for ($i=0; $i < $deduplicated_hits; $i++) {
>>     > +      $string .= $sorted[$i];
>>     > +      if ($subtest_names_hit{$sorted[$i]} > 1) {
>>     > +        $string .= "($subtest_names_hit{$sorted[$i]})"
>>     > +      }
>>     > +      $string .= ",";
>>     > +    }
>>     > + 
>>     > +    $string =~ s/,$//;
>>     > + 
>>     > +    $string .= " (Total Subtest Hits: $total_hits / Deduplicated Total
>>     Hits: $deduplicated_hits)";
>>     > + 
>>     > +    return $string;
>>     > +
>>     > +  } else {
>>     > +    #return the simpler string with duplicates and commas
>>     > +    return join(',', sort @{$self->{subtest_names_hit}});
>>     > +  }
>>     >  }
>>     > 
>>     >  ########################################################################
>>     ###
>>
>>
>>
>> References:
>>
>> [1] mailto:hege@hege.li
>> [2] mailto:hege@hege.li
>> [3] http://72_active.cf/
>> [4] http://72_active.cf/
>> [5] http://72_active.cf/
>> [6] http://72_active.cf/
>> [7] https://www.linkedin.com/in/kmcgrail
>> [8] https://www.linkedin.com/in/kmcgrail
>> [9] mailto:hege@hege.li
>> [10] mailto:hege@hege.li
>> [11] http://72_active.cf/
>> [12] http://72_active.cf/
>> [13] http://72_active.cf/
>> [14] http://72_active.cf/
>> [15] https://www.linkedin.com/in/kmcgrail
>> [16] https://www.linkedin.com/in/kmcgrail


-- 
Kevin A. McGrail
Member, Apache Software Foundation
Chair Emeritus Apache SpamAssassin Project
https://www.linkedin.com/in/kmcgrail - 703.798.0171



Re: The obviously different case of subtest debug flood

Posted by Bill Cole <sa...@billmail.scconsult.com>.
On 7 Jun 2019, at 14:53, John Hardin wrote:

> Now if the hits were duplicates, and we logged something like:
>
> Jun  7 11:25:44.265 [1569] dbg: rules: ran body rule __LOWER_E ======> 
> got hit: "e" (100)
>
> ...where we're not collapsing on solely the rule name, I'd accept 
> that.

FWIW, __LOWER_E specifically is this:

body            __LOWER_E       /e/

So there's no issue of hits varying at all. It's evil twin 
__E_LIKE_LETTER will hit on anything that looks like an 'e' so it does 
have diverse hits.

-- 
Bill Cole
bill@scconsult.com or billcole@apache.org
(AKA @grumpybozo and many *@billmail.scconsult.com addresses)
Not Currently Available For Hire

Re: The obviously different case of subtest debug flood

Posted by John Hardin <jh...@impsec.org>.
On Fri, 7 Jun 2019, Henrik K wrote:

> On Fri, Jun 07, 2019 at 07:48:56AM -0700, John Hardin wrote:
>> On Fri, 7 Jun 2019, Henrik K wrote:
>>
>>> Just committed a simple log suppressor for these kinds of spam..
>>>
>>> Jun  7 11:25:44.264 [1569] dbg: rules: ran body rule __LOWER_E ======> got hit: "e"
>>> Jun  7 11:25:44.264 [1569] dbg: rules: ran body rule __LOWER_E ======> got hit: "e"
>>> Jun  7 11:25:44.264 [1569] dbg: rules: ran body rule __LOWER_E ======> got hit: "e"
>>> Jun  7 11:25:44.264 [1569] dbg: rules: ran body rule __LOWER_E ======> got hit: "e"
>>> Jun  7 11:25:44.265 [1569] dbg: rules: ran body rule __LOWER_E ======> got hit: "e"
>>> Jun  7 11:25:44.265 [1569] dbg: rules: ran body rule __LOWER_E ======> got hit: "e"
>>> Jun  7 11:25:44.265 [1569] dbg: rules: ran body rule __LOWER_E ======> got hit: "e"
>>> Jun  7 11:25:44.265 [1569] dbg: rules: ran body rule __LOWER_E ======> got hit: "e"
>>> Jun  7 11:25:44.265 [1569] dbg: rules: ran body rule __LOWER_E ======> got hit: "e"
>>> Jun  7 11:25:44.265 [1569] dbg: rules: ran body rule __LOWER_E ======> got hit: "e"
>>> Jun  7 11:25:44.269 [1569] dbg: --- last message repeated 21 times ---
>>
>> Veto doing that. That information is very useful when debugging rules.
>
> Well the information is there.  In many places.  You are saying you are
> consistently using things like spamassassin -t -D | grep __LOWER_E | wc -l
> to debug your rules?

If I'm working on a multiple rule that is something complex (like the text 
variations in the bitcoin extortion and fraud rules) then I want to see 
all the hits and, more importantly, what hit on each.

Now if the hits were duplicates, and we logged something like:

Jun  7 11:25:44.265 [1569] dbg: rules: ran body rule __LOWER_E ======> got hit: "e" (100)

...where we're not collapsing on solely the rule name, I'd accept that.


-- 
  John Hardin KA7OHZ                    http://www.impsec.org/~jhardin/
  jhardin@impsec.org    FALaholic #11174     pgpk -a jhardin@impsec.org
  key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
-----------------------------------------------------------------------
   The ["assault weapons"] ban is the moral equivalent of banning red
   cars because they look too fast.  -- Steve Chapman, Chicago Tribune
-----------------------------------------------------------------------
  3 days until the 52nd anniversary of Israel's victory in the Six-Day War

Re: The obviously different case of subtest debug flood

Posted by "Kevin A. McGrail" <km...@apache.org>.
On 6/7/2019 11:33 AM, Henrik K wrote:
> Well the information is there.  In many places.  You are saying you are
> consistently using things like spamassassin -t -D | grep __LOWER_E | wc -l
> to debug your rules?

Close.  I am consistently using spamassassin -t -D 2>&1 | grep -i -e
Content\ analysis -e KAM, for example and since I write mostly meta
rules, I get some really long and hard to read subtest debug lines.

-- 
Kevin A. McGrail
Member, Apache Software Foundation
Chair Emeritus Apache SpamAssassin Project
https://www.linkedin.com/in/kmcgrail - 703.798.0171


The obviously different case of subtest debug flood

Posted by Henrik K <he...@hege.li>.
On Fri, Jun 07, 2019 at 07:48:56AM -0700, John Hardin wrote:
> On Fri, 7 Jun 2019, Henrik K wrote:
> 
> >Just committed a simple log suppressor for these kinds of spam..
> >
> >Jun  7 11:25:44.264 [1569] dbg: rules: ran body rule __LOWER_E ======> got hit: "e"
> >Jun  7 11:25:44.264 [1569] dbg: rules: ran body rule __LOWER_E ======> got hit: "e"
> >Jun  7 11:25:44.264 [1569] dbg: rules: ran body rule __LOWER_E ======> got hit: "e"
> >Jun  7 11:25:44.264 [1569] dbg: rules: ran body rule __LOWER_E ======> got hit: "e"
> >Jun  7 11:25:44.265 [1569] dbg: rules: ran body rule __LOWER_E ======> got hit: "e"
> >Jun  7 11:25:44.265 [1569] dbg: rules: ran body rule __LOWER_E ======> got hit: "e"
> >Jun  7 11:25:44.265 [1569] dbg: rules: ran body rule __LOWER_E ======> got hit: "e"
> >Jun  7 11:25:44.265 [1569] dbg: rules: ran body rule __LOWER_E ======> got hit: "e"
> >Jun  7 11:25:44.265 [1569] dbg: rules: ran body rule __LOWER_E ======> got hit: "e"
> >Jun  7 11:25:44.265 [1569] dbg: rules: ran body rule __LOWER_E ======> got hit: "e"
> >Jun  7 11:25:44.269 [1569] dbg: --- last message repeated 21 times ---
> 
> Veto doing that. That information is very useful when debugging rules.

Well the information is there.  In many places.  You are saying you are
consistently using things like spamassassin -t -D | grep __LOWER_E | wc -l
to debug your rules?


Re: __E_LIKE_LETTER & __LOWER_E filling subtests debug

Posted by John Hardin <jh...@impsec.org>.
On Fri, 7 Jun 2019, Henrik K wrote:

> Just committed a simple log suppressor for these kinds of spam..
>
> Jun  7 11:25:44.264 [1569] dbg: rules: ran body rule __LOWER_E ======> got hit: "e"
> Jun  7 11:25:44.264 [1569] dbg: rules: ran body rule __LOWER_E ======> got hit: "e"
> Jun  7 11:25:44.264 [1569] dbg: rules: ran body rule __LOWER_E ======> got hit: "e"
> Jun  7 11:25:44.264 [1569] dbg: rules: ran body rule __LOWER_E ======> got hit: "e"
> Jun  7 11:25:44.265 [1569] dbg: rules: ran body rule __LOWER_E ======> got hit: "e"
> Jun  7 11:25:44.265 [1569] dbg: rules: ran body rule __LOWER_E ======> got hit: "e"
> Jun  7 11:25:44.265 [1569] dbg: rules: ran body rule __LOWER_E ======> got hit: "e"
> Jun  7 11:25:44.265 [1569] dbg: rules: ran body rule __LOWER_E ======> got hit: "e"
> Jun  7 11:25:44.265 [1569] dbg: rules: ran body rule __LOWER_E ======> got hit: "e"
> Jun  7 11:25:44.265 [1569] dbg: rules: ran body rule __LOWER_E ======> got hit: "e"
> Jun  7 11:25:44.269 [1569] dbg: --- last message repeated 21 times ---

Veto doing that. That information is very useful when debugging rules.

Also, that's not the context here. The change under discussion fixes a log 
like with "__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E,__LOWER_E"
ad nauseum.

{snipping history}

-- 
  John Hardin KA7OHZ                    http://www.impsec.org/~jhardin/
  jhardin@impsec.org    FALaholic #11174     pgpk -a jhardin@impsec.org
  key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
-----------------------------------------------------------------------
   News flash: Lowest Common Denominator down 50 points
-----------------------------------------------------------------------
  3 days until the 52nd anniversary of Israel's victory in the Six-Day War

Re: __E_LIKE_LETTER & __LOWER_E filling subtests debug

Posted by Henrik K <he...@hege.li>.
Just committed a simple log suppressor for these kinds of spam..

Jun  7 11:25:44.264 [1569] dbg: rules: ran body rule __LOWER_E ======> got hit: "e"
Jun  7 11:25:44.264 [1569] dbg: rules: ran body rule __LOWER_E ======> got hit: "e"
Jun  7 11:25:44.264 [1569] dbg: rules: ran body rule __LOWER_E ======> got hit: "e"
Jun  7 11:25:44.264 [1569] dbg: rules: ran body rule __LOWER_E ======> got hit: "e"
Jun  7 11:25:44.265 [1569] dbg: rules: ran body rule __LOWER_E ======> got hit: "e"
Jun  7 11:25:44.265 [1569] dbg: rules: ran body rule __LOWER_E ======> got hit: "e"
Jun  7 11:25:44.265 [1569] dbg: rules: ran body rule __LOWER_E ======> got hit: "e"
Jun  7 11:25:44.265 [1569] dbg: rules: ran body rule __LOWER_E ======> got hit: "e"
Jun  7 11:25:44.265 [1569] dbg: rules: ran body rule __LOWER_E ======> got hit: "e"
Jun  7 11:25:44.265 [1569] dbg: rules: ran body rule __LOWER_E ======> got hit: "e"
Jun  7 11:25:44.269 [1569] dbg: --- last message repeated 21 times ---




On Fri, Jun 07, 2019 at 09:50:51AM +0300, Henrik K wrote:
> 
> What does "unreadable for rule analysis" mean?  Surely no one is actually
> manually reading such lines one rule at a time?  Computers can check and
> grep for you.. ;-)
> 
> I think this needs a little bit more of thought what we really want to
> accomplish here and maybe do it in a bug along with the new templates and
> stuff if needed..
> 
> 
> 
> On Thu, Jun 06, 2019 at 07:48:02AM -0400, Kevin A. McGrail wrote:
> > That is a frightening one liner.  Should we use it?
> > 
> > As for the more output comment, if you have emails with 300 lower case e's, you
> > get 300 hits for the subtext.  It is unreadable for rule analysis.
> > 
> > As for modifying the normal output, I have no idea if anyone out there is using
> > the public routine so better to be safe.
> > 
> > I didn't find a tag for subtests either. That might be a good 4.0 addition.
> > 
> > Regards, KAM
> > 
> > On Thu, Jun 6, 2019, 01:30 Henrik K <[1...@hege.li> wrote:
> > 
> > 
> >     Well in theory you see _more_ debug output now when there are no
> >     duplicates,
> >     due to the stats string..  honestly atleast I wouldn't care about that.
> >     Feel free to vote.
> > 
> >     As a silly morning exercise, here's a one-liner that compacts stuff :-P
> > 
> >     my $foo = '__A,__B,__C,__C,__C,__CC,__D,__D,__E,__E';
> >     my $m; $foo =~ s/([^,]+)(?{$m=1})(?:,\1(?=,|$)(?{$m++}))+/"$1($m)"/eg;
> > 
> >     __A,__B,__C(3),__CC,__D(2),__E(2)
> > 
> > 
> >     On Wed, Jun 05, 2019 at 08:25:00PM -0400, Kevin A. McGrail wrote:
> >     > Good point, Henrik & John.
> >     >
> >     > OK, I've left the output alone except for the calls from dbg so it
> >     > shouldn't break anything in the public interface.
> >     >
> >     > Thoughts on this version?
> >     >
> >     > Regards,
> >     > KAM
> >     >
> >     > On 6/4/2019 1:51 PM, John Hardin wrote:
> >     > > On Tue, 4 Jun 2019, Kevin A. McGrail wrote:
> >     > >
> >     > >> Yes, I was thinking about that and wanting to fix uritests so well
> >     > >> for the
> >     > >> template.   Thanks for the feedback.  I will take another pass at it.
> >     > >
> >     > > Just do the deduplication without modifying the output format.
> >     > >
> >     > > If we want to log the hit counts, then make another function that does
> >     > > what you did and use it for logging.
> >     > >
> >     > >
> >     > >> On Tue, Jun 4, 2019, 03:23 Henrik K <[2...@hege.li> wrote:
> >     > >>
> >     > >>>
> >     > >>> If you want to modify debug output, you have to modify only the dbg()
> >     > >>> output
> >     > >>> itself.  You can't modify internal functions that have specific
> >     output
> >     > >>> formats and start adding random strings to them.  Atleast these
> >     places
> >     > >>> depend on the comma delimited rules:
> >     > >>>
> >     > >>> ./masses/mass-check:    push @tests, split(/,/,
> >     > >>> $status->get_names_of_subtests_hit());
> >     > >>> ./t/rule_tests.t:    my %rules_hit = map { $_ => 1 }
> >     > >>> split(/,/,$msg->get_names_of_tests_hit()),
> >     > >>> split(/,/,$msg->get_names_of_subtests_hit());
> >     > >>> ./t.rules/run:  my $testsline =
> >     > >>> $status->get_names_of_tests_hit().",".$status->
> >     get_names_of_subtests_hit();
> >     > >>>
> >     > >>>
> >     > >>>
> >     > >>>
> >     > >>> On Tue, Jun 04, 2019 at 01:56:26AM -0400, Kevin A. McGrail wrote:
> >     > >>>> Morning All,
> >     > >>>>
> >     > >>>> After a few thoughts on limits, it appears that any duplicate
> >     subtest
> >     > >>>> hits are best combined for debug output.
> >     > >>>>
> >     > >>>> Any thoughts on the attached?  It looks like it will help me with
> >     rule
> >     > >>>> development while support rules with valid but large maxhits like
> >     > >>> __LOWER_E
> >     > >>>>
> >     > >>>> Regards,
> >     > >>>> KAM
> >     > >>>>
> >     > >>>> On 5/31/2019 10:30 AM, Bill Cole wrote:
> >     > >>>>> On 30 May 2019, at 20:35, Kevin A. McGrail wrote:
> >     > >>>>>
> >     > >>>>>> I was curious if anyone noticed the debug output for subtests has
> >     > >>> gotten
> >     > >>>>>> insane:
> >     > >>>>>
> >     > >>>>> It got a little discussion on users@ when I created those rules.
> >     > >>>>>
> >     > >>>>> [...]
> >     > >>>>>
> >     > >>>>>> [3]72_active.cf:    body            __LOWER_E       /e/
> >     > >>>>>> [4]72_active.cf:    tflags          __LOWER_E       multiple
> >     > >>>>>> maxhits=230
> >     > >>>>>>
> >     > >>>>>> [5]72_active.cf:    body            __E_LIKE_LETTER /<lcase_e>/
> >     > >>>>>> [6]72_active.cf:    tflags          __E_LIKE_LETTER multiple
> >     > >>>>>> maxhits=320
> >     > >>>>>>
> >     > >>>>>> Assuming those maxhits are correct,
> >     > >>>>>
> >     > >>>>> They are. In fact they were carefully tuned to catch the targeted
> >     > >>>>> extortion spam.
> >     > >>>>>
> >     > >>>>>> maybe we need something in the debug
> >     > >>>>>> output that says __E_LIKE_LETTER (number of hits if more than 1).
> >     > >>>>>
> >     > >>>>> That would be a useful enhancement even without my flagrant log
> >     > >>>>> vandalism.
> >     > >>>>>
> >     > >>>>
> >     > >>>> --
> >     > >>>> Kevin A. McGrail
> >     > >>>> Member, Apache Software Foundation
> >     > >>>> Chair Emeritus Apache SpamAssassin Project
> >     > >>>> [7]https://www.linkedin.com/in/kmcgrail - 703.798.0171
> >     > >>>>
> >     > >>>
> >     > >>>> Index: lib/Mail/SpamAssassin/PerMsgStatus.pm
> >     > >>>> ===================================================================
> >     > >>>> --- lib/Mail/SpamAssassin/PerMsgStatus.pm       (revision 1860582)
> >     > >>>> +++ lib/Mail/SpamAssassin/PerMsgStatus.pm       (working copy)
> >     > >>>> @@ -769,7 +769,38 @@
> >     > >>>>  sub get_names_of_subtests_hit {
> >     > >>>>    my ($self) = @_;
> >     > >>>>
> >     > >>>> -  return join(',', sort @{$self->{subtest_names_hit}});
> >     > >>>> +  #return join(',', sort @{$self->{subtest_names_hit}});
> >     > >>>> +
> >     > >>>> +  #This routine prints only one instance of a subrule hit with a
> >     > >>>> count
> >     > >>> of how many times it hit if greater than 1
> >     > >>>> +  my (%subtest_names_hit, $i, $key, @keys, @sorted, $string, $rule,
> >     > >>> $total_hits, $deduplicated_hits);
> >     > >>>> +
> >     > >>>> +  $total_hits = scalar(@{$self->{subtest_names_hit}});
> >     > >>>> +
> >     > >>>> +  for ($i=0; $i < $total_hits; $i++) {
> >     > >>>> +    $rule = ${$self->{subtest_names_hit}}[$i];
> >     > >>>> +    $subtest_names_hit{$rule}++;
> >     > >>>> +  }
> >     > >>>> +
> >     > >>>> +  foreach $key (keys %subtest_names_hit) {
> >     > >>>> +    push (@keys, $key);
> >     > >>>> +  }
> >     > >>>> +  @sorted = sort @keys;
> >     > >>>> +
> >     > >>>> +  $deduplicated_hits = scalar(@sorted);
> >     > >>>> +
> >     > >>>> +  for ($i=0; $i < $deduplicated_hits; $i++) {
> >     > >>>> +    $string .= $sorted[$i];
> >     > >>>> +    if ($subtest_names_hit{$sorted[$i]} > 1) {
> >     > >>>> +      $string .= "($subtest_names_hit{$sorted[$i]})"
> >     > >>>> +    }
> >     > >>>> +    $string .= ",";
> >     > >>>> +  }
> >     > >>>> +
> >     > >>>> +  $string =~ s/,$//;
> >     > >>>> +
> >     > >>>> +  $string .= " (Total Subtest Hits: $total_hits / Deduplicated
> >     Total
> >     > >>> Hits: $deduplicated_hits)";
> >     > >>>> +
> >     > >>>> +  return $string;
> >     > >>>>  }
> >     > >>>>
> >     > >>>>
> >     > >>> #####################################################################
> >     ######
> >     > >>>
> >     > >>>
> >     > >>>
> >     > >>
> >     > >
> >     >
> >     > --
> >     > Kevin A. McGrail
> >     > Member, Apache Software Foundation
> >     > Chair Emeritus Apache SpamAssassin Project
> >     > [8]https://www.linkedin.com/in/kmcgrail - 703.798.0171
> >     >
> > 
> >     > Index: lib/Mail/SpamAssassin/PerMsgStatus.pm
> >     > ===================================================================
> >     > --- lib/Mail/SpamAssassin/PerMsgStatus.pm       (revision 1860582)
> >     > +++ lib/Mail/SpamAssassin/PerMsgStatus.pm       (working copy)
> >     > @@ -398,7 +398,7 @@
> >     >    dbg("check: is spam? score=".$self->{score}.
> >     >                          " required=".$self->{conf}->{required_score});
> >     >    dbg("check: tests=".$self->get_names_of_tests_hit());
> >     > -  dbg("check: subtests=".$self->get_names_of_subtests_hit());
> >     > +  dbg("check: subtests=".$self->get_names_of_subtests_hit("dbg"));
> >     >    $self->{is_spam} = $self->is_spam();
> >     > 
> >     >    $self->{main}->{resolver}->bgabort();
> >     > @@ -764,12 +764,52 @@
> >     >  normally-hidden rules, which score 0 and have names beginning with two
> >     >  underscores, used in meta rules.
> >     > 
> >     > +If a parameter of dbg is passed, the output will be more condensed and
> >     > +sub-tests with multiple hits reduced to one entry with the number of
> >     hits
> >     > +in parentheses. Some information is also added at the end regarding the
> >     > +multiple hits.
> >     > +
> >     >  =cut
> >     > 
> >     >  sub get_names_of_subtests_hit {
> >     > -  my ($self) = @_;
> >     > +  my ($self, $mode) = @_;
> >     > 
> >     > -  return join(',', sort @{$self->{subtest_names_hit}});
> >     > +  if (defined $mode && $mode eq 'dbg') {
> >     > +    #This routine prints only one instance of a subrule hit with a count
> >     of how many times it hit if greater than 1
> >     > +    my (%subtest_names_hit, $i, $key, @keys, @sorted, $string, $rule,
> >     $total_hits, $deduplicated_hits); 
> >     > + 
> >     > +    $total_hits = scalar(@{$self->{subtest_names_hit}});
> >     > + 
> >     > +    for ($i=0; $i < $total_hits; $i++) {
> >     > +      $rule = ${$self->{subtest_names_hit}}[$i];
> >     > +      $subtest_names_hit{$rule}++;
> >     > +    }
> >     > + 
> >     > +    foreach $key (keys %subtest_names_hit) {
> >     > +      push (@keys, $key);
> >     > +    }
> >     > +    @sorted = sort @keys;
> >     > + 
> >     > +    $deduplicated_hits = scalar(@sorted);
> >     > + 
> >     > +    for ($i=0; $i < $deduplicated_hits; $i++) {
> >     > +      $string .= $sorted[$i];
> >     > +      if ($subtest_names_hit{$sorted[$i]} > 1) {
> >     > +        $string .= "($subtest_names_hit{$sorted[$i]})"
> >     > +      }
> >     > +      $string .= ",";
> >     > +    }
> >     > + 
> >     > +    $string =~ s/,$//;
> >     > + 
> >     > +    $string .= " (Total Subtest Hits: $total_hits / Deduplicated Total
> >     Hits: $deduplicated_hits)";
> >     > + 
> >     > +    return $string;
> >     > +
> >     > +  } else {
> >     > +    #return the simpler string with duplicates and commas
> >     > +    return join(',', sort @{$self->{subtest_names_hit}});
> >     > +  }
> >     >  }
> >     > 
> >     >  ########################################################################
> >     ###
> > 
> > 
> > 
> > On Thu, Jun 6, 2019, 01:30 Henrik K <[9...@hege.li> wrote:
> > 
> > 
> >     Well in theory you see _more_ debug output now when there are no
> >     duplicates,
> >     due to the stats string..  honestly atleast I wouldn't care about that.
> >     Feel free to vote.
> > 
> >     As a silly morning exercise, here's a one-liner that compacts stuff :-P
> > 
> >     my $foo = '__A,__B,__C,__C,__C,__CC,__D,__D,__E,__E';
> >     my $m; $foo =~ s/([^,]+)(?{$m=1})(?:,\1(?=,|$)(?{$m++}))+/"$1($m)"/eg;
> > 
> >     __A,__B,__C(3),__CC,__D(2),__E(2)
> > 
> > 
> >     On Wed, Jun 05, 2019 at 08:25:00PM -0400, Kevin A. McGrail wrote:
> >     > Good point, Henrik & John.
> >     >
> >     > OK, I've left the output alone except for the calls from dbg so it
> >     > shouldn't break anything in the public interface.
> >     >
> >     > Thoughts on this version?
> >     >
> >     > Regards,
> >     > KAM
> >     >
> >     > On 6/4/2019 1:51 PM, John Hardin wrote:
> >     > > On Tue, 4 Jun 2019, Kevin A. McGrail wrote:
> >     > >
> >     > >> Yes, I was thinking about that and wanting to fix uritests so well
> >     > >> for the
> >     > >> template.   Thanks for the feedback.  I will take another pass at it.
> >     > >
> >     > > Just do the deduplication without modifying the output format.
> >     > >
> >     > > If we want to log the hit counts, then make another function that does
> >     > > what you did and use it for logging.
> >     > >
> >     > >
> >     > >> On Tue, Jun 4, 2019, 03:23 Henrik K <[1...@hege.li> wrote:
> >     > >>
> >     > >>>
> >     > >>> If you want to modify debug output, you have to modify only the dbg()
> >     > >>> output
> >     > >>> itself.  You can't modify internal functions that have specific
> >     output
> >     > >>> formats and start adding random strings to them.  Atleast these
> >     places
> >     > >>> depend on the comma delimited rules:
> >     > >>>
> >     > >>> ./masses/mass-check:    push @tests, split(/,/,
> >     > >>> $status->get_names_of_subtests_hit());
> >     > >>> ./t/rule_tests.t:    my %rules_hit = map { $_ => 1 }
> >     > >>> split(/,/,$msg->get_names_of_tests_hit()),
> >     > >>> split(/,/,$msg->get_names_of_subtests_hit());
> >     > >>> ./t.rules/run:  my $testsline =
> >     > >>> $status->get_names_of_tests_hit().",".$status->
> >     get_names_of_subtests_hit();
> >     > >>>
> >     > >>>
> >     > >>>
> >     > >>>
> >     > >>> On Tue, Jun 04, 2019 at 01:56:26AM -0400, Kevin A. McGrail wrote:
> >     > >>>> Morning All,
> >     > >>>>
> >     > >>>> After a few thoughts on limits, it appears that any duplicate
> >     subtest
> >     > >>>> hits are best combined for debug output.
> >     > >>>>
> >     > >>>> Any thoughts on the attached?  It looks like it will help me with
> >     rule
> >     > >>>> development while support rules with valid but large maxhits like
> >     > >>> __LOWER_E
> >     > >>>>
> >     > >>>> Regards,
> >     > >>>> KAM
> >     > >>>>
> >     > >>>> On 5/31/2019 10:30 AM, Bill Cole wrote:
> >     > >>>>> On 30 May 2019, at 20:35, Kevin A. McGrail wrote:
> >     > >>>>>
> >     > >>>>>> I was curious if anyone noticed the debug output for subtests has
> >     > >>> gotten
> >     > >>>>>> insane:
> >     > >>>>>
> >     > >>>>> It got a little discussion on users@ when I created those rules.
> >     > >>>>>
> >     > >>>>> [...]
> >     > >>>>>
> >     > >>>>>> [11]72_active.cf:    body            __LOWER_E       /e/
> >     > >>>>>> [12]72_active.cf:    tflags          __LOWER_E       multiple
> >     > >>>>>> maxhits=230
> >     > >>>>>>
> >     > >>>>>> [13]72_active.cf:    body            __E_LIKE_LETTER /<lcase_e>/
> >     > >>>>>> [14]72_active.cf:    tflags          __E_LIKE_LETTER multiple
> >     > >>>>>> maxhits=320
> >     > >>>>>>
> >     > >>>>>> Assuming those maxhits are correct,
> >     > >>>>>
> >     > >>>>> They are. In fact they were carefully tuned to catch the targeted
> >     > >>>>> extortion spam.
> >     > >>>>>
> >     > >>>>>> maybe we need something in the debug
> >     > >>>>>> output that says __E_LIKE_LETTER (number of hits if more than 1).
> >     > >>>>>
> >     > >>>>> That would be a useful enhancement even without my flagrant log
> >     > >>>>> vandalism.
> >     > >>>>>
> >     > >>>>
> >     > >>>> --
> >     > >>>> Kevin A. McGrail
> >     > >>>> Member, Apache Software Foundation
> >     > >>>> Chair Emeritus Apache SpamAssassin Project
> >     > >>>> [15]https://www.linkedin.com/in/kmcgrail - 703.798.0171
> >     > >>>>
> >     > >>>
> >     > >>>> Index: lib/Mail/SpamAssassin/PerMsgStatus.pm
> >     > >>>> ===================================================================
> >     > >>>> --- lib/Mail/SpamAssassin/PerMsgStatus.pm       (revision 1860582)
> >     > >>>> +++ lib/Mail/SpamAssassin/PerMsgStatus.pm       (working copy)
> >     > >>>> @@ -769,7 +769,38 @@
> >     > >>>>  sub get_names_of_subtests_hit {
> >     > >>>>    my ($self) = @_;
> >     > >>>>
> >     > >>>> -  return join(',', sort @{$self->{subtest_names_hit}});
> >     > >>>> +  #return join(',', sort @{$self->{subtest_names_hit}});
> >     > >>>> +
> >     > >>>> +  #This routine prints only one instance of a subrule hit with a
> >     > >>>> count
> >     > >>> of how many times it hit if greater than 1
> >     > >>>> +  my (%subtest_names_hit, $i, $key, @keys, @sorted, $string, $rule,
> >     > >>> $total_hits, $deduplicated_hits);
> >     > >>>> +
> >     > >>>> +  $total_hits = scalar(@{$self->{subtest_names_hit}});
> >     > >>>> +
> >     > >>>> +  for ($i=0; $i < $total_hits; $i++) {
> >     > >>>> +    $rule = ${$self->{subtest_names_hit}}[$i];
> >     > >>>> +    $subtest_names_hit{$rule}++;
> >     > >>>> +  }
> >     > >>>> +
> >     > >>>> +  foreach $key (keys %subtest_names_hit) {
> >     > >>>> +    push (@keys, $key);
> >     > >>>> +  }
> >     > >>>> +  @sorted = sort @keys;
> >     > >>>> +
> >     > >>>> +  $deduplicated_hits = scalar(@sorted);
> >     > >>>> +
> >     > >>>> +  for ($i=0; $i < $deduplicated_hits; $i++) {
> >     > >>>> +    $string .= $sorted[$i];
> >     > >>>> +    if ($subtest_names_hit{$sorted[$i]} > 1) {
> >     > >>>> +      $string .= "($subtest_names_hit{$sorted[$i]})"
> >     > >>>> +    }
> >     > >>>> +    $string .= ",";
> >     > >>>> +  }
> >     > >>>> +
> >     > >>>> +  $string =~ s/,$//;
> >     > >>>> +
> >     > >>>> +  $string .= " (Total Subtest Hits: $total_hits / Deduplicated
> >     Total
> >     > >>> Hits: $deduplicated_hits)";
> >     > >>>> +
> >     > >>>> +  return $string;
> >     > >>>>  }
> >     > >>>>
> >     > >>>>
> >     > >>> #####################################################################
> >     ######
> >     > >>>
> >     > >>>
> >     > >>>
> >     > >>
> >     > >
> >     >
> >     > --
> >     > Kevin A. McGrail
> >     > Member, Apache Software Foundation
> >     > Chair Emeritus Apache SpamAssassin Project
> >     > [16]https://www.linkedin.com/in/kmcgrail - 703.798.0171
> >     >
> > 
> >     > Index: lib/Mail/SpamAssassin/PerMsgStatus.pm
> >     > ===================================================================
> >     > --- lib/Mail/SpamAssassin/PerMsgStatus.pm       (revision 1860582)
> >     > +++ lib/Mail/SpamAssassin/PerMsgStatus.pm       (working copy)
> >     > @@ -398,7 +398,7 @@
> >     >    dbg("check: is spam? score=".$self->{score}.
> >     >                          " required=".$self->{conf}->{required_score});
> >     >    dbg("check: tests=".$self->get_names_of_tests_hit());
> >     > -  dbg("check: subtests=".$self->get_names_of_subtests_hit());
> >     > +  dbg("check: subtests=".$self->get_names_of_subtests_hit("dbg"));
> >     >    $self->{is_spam} = $self->is_spam();
> >     > 
> >     >    $self->{main}->{resolver}->bgabort();
> >     > @@ -764,12 +764,52 @@
> >     >  normally-hidden rules, which score 0 and have names beginning with two
> >     >  underscores, used in meta rules.
> >     > 
> >     > +If a parameter of dbg is passed, the output will be more condensed and
> >     > +sub-tests with multiple hits reduced to one entry with the number of
> >     hits
> >     > +in parentheses. Some information is also added at the end regarding the
> >     > +multiple hits.
> >     > +
> >     >  =cut
> >     > 
> >     >  sub get_names_of_subtests_hit {
> >     > -  my ($self) = @_;
> >     > +  my ($self, $mode) = @_;
> >     > 
> >     > -  return join(',', sort @{$self->{subtest_names_hit}});
> >     > +  if (defined $mode && $mode eq 'dbg') {
> >     > +    #This routine prints only one instance of a subrule hit with a count
> >     of how many times it hit if greater than 1
> >     > +    my (%subtest_names_hit, $i, $key, @keys, @sorted, $string, $rule,
> >     $total_hits, $deduplicated_hits); 
> >     > + 
> >     > +    $total_hits = scalar(@{$self->{subtest_names_hit}});
> >     > + 
> >     > +    for ($i=0; $i < $total_hits; $i++) {
> >     > +      $rule = ${$self->{subtest_names_hit}}[$i];
> >     > +      $subtest_names_hit{$rule}++;
> >     > +    }
> >     > + 
> >     > +    foreach $key (keys %subtest_names_hit) {
> >     > +      push (@keys, $key);
> >     > +    }
> >     > +    @sorted = sort @keys;
> >     > + 
> >     > +    $deduplicated_hits = scalar(@sorted);
> >     > + 
> >     > +    for ($i=0; $i < $deduplicated_hits; $i++) {
> >     > +      $string .= $sorted[$i];
> >     > +      if ($subtest_names_hit{$sorted[$i]} > 1) {
> >     > +        $string .= "($subtest_names_hit{$sorted[$i]})"
> >     > +      }
> >     > +      $string .= ",";
> >     > +    }
> >     > + 
> >     > +    $string =~ s/,$//;
> >     > + 
> >     > +    $string .= " (Total Subtest Hits: $total_hits / Deduplicated Total
> >     Hits: $deduplicated_hits)";
> >     > + 
> >     > +    return $string;
> >     > +
> >     > +  } else {
> >     > +    #return the simpler string with duplicates and commas
> >     > +    return join(',', sort @{$self->{subtest_names_hit}});
> >     > +  }
> >     >  }
> >     > 
> >     >  ########################################################################
> >     ###
> > 
> > 
> > 
> > References:
> > 
> > [1] mailto:hege@hege.li
> > [2] mailto:hege@hege.li
> > [3] http://72_active.cf/
> > [4] http://72_active.cf/
> > [5] http://72_active.cf/
> > [6] http://72_active.cf/
> > [7] https://www.linkedin.com/in/kmcgrail
> > [8] https://www.linkedin.com/in/kmcgrail
> > [9] mailto:hege@hege.li
> > [10] mailto:hege@hege.li
> > [11] http://72_active.cf/
> > [12] http://72_active.cf/
> > [13] http://72_active.cf/
> > [14] http://72_active.cf/
> > [15] https://www.linkedin.com/in/kmcgrail
> > [16] https://www.linkedin.com/in/kmcgrail

Re: __E_LIKE_LETTER & __LOWER_E filling subtests debug

Posted by Henrik K <he...@hege.li>.
What does "unreadable for rule analysis" mean?  Surely no one is actually
manually reading such lines one rule at a time?  Computers can check and
grep for you.. ;-)

I think this needs a little bit more of thought what we really want to
accomplish here and maybe do it in a bug along with the new templates and
stuff if needed..



On Thu, Jun 06, 2019 at 07:48:02AM -0400, Kevin A. McGrail wrote:
> That is a frightening one liner.  Should we use it?
> 
> As for the more output comment, if you have emails with 300 lower case e's, you
> get 300 hits for the subtext.  It is unreadable for rule analysis.
> 
> As for modifying the normal output, I have no idea if anyone out there is using
> the public routine so better to be safe.
> 
> I didn't find a tag for subtests either. That might be a good 4.0 addition.
> 
> Regards, KAM
> 
> On Thu, Jun 6, 2019, 01:30 Henrik K <[1...@hege.li> wrote:
> 
> 
>     Well in theory you see _more_ debug output now when there are no
>     duplicates,
>     due to the stats string..  honestly atleast I wouldn't care about that.
>     Feel free to vote.
> 
>     As a silly morning exercise, here's a one-liner that compacts stuff :-P
> 
>     my $foo = '__A,__B,__C,__C,__C,__CC,__D,__D,__E,__E';
>     my $m; $foo =~ s/([^,]+)(?{$m=1})(?:,\1(?=,|$)(?{$m++}))+/"$1($m)"/eg;
> 
>     __A,__B,__C(3),__CC,__D(2),__E(2)
> 
> 
>     On Wed, Jun 05, 2019 at 08:25:00PM -0400, Kevin A. McGrail wrote:
>     > Good point, Henrik & John.
>     >
>     > OK, I've left the output alone except for the calls from dbg so it
>     > shouldn't break anything in the public interface.
>     >
>     > Thoughts on this version?
>     >
>     > Regards,
>     > KAM
>     >
>     > On 6/4/2019 1:51 PM, John Hardin wrote:
>     > > On Tue, 4 Jun 2019, Kevin A. McGrail wrote:
>     > >
>     > >> Yes, I was thinking about that and wanting to fix uritests so well
>     > >> for the
>     > >> template.   Thanks for the feedback.  I will take another pass at it.
>     > >
>     > > Just do the deduplication without modifying the output format.
>     > >
>     > > If we want to log the hit counts, then make another function that does
>     > > what you did and use it for logging.
>     > >
>     > >
>     > >> On Tue, Jun 4, 2019, 03:23 Henrik K <[2...@hege.li> wrote:
>     > >>
>     > >>>
>     > >>> If you want to modify debug output, you have to modify only the dbg()
>     > >>> output
>     > >>> itself.  You can't modify internal functions that have specific
>     output
>     > >>> formats and start adding random strings to them.  Atleast these
>     places
>     > >>> depend on the comma delimited rules:
>     > >>>
>     > >>> ./masses/mass-check:    push @tests, split(/,/,
>     > >>> $status->get_names_of_subtests_hit());
>     > >>> ./t/rule_tests.t:    my %rules_hit = map { $_ => 1 }
>     > >>> split(/,/,$msg->get_names_of_tests_hit()),
>     > >>> split(/,/,$msg->get_names_of_subtests_hit());
>     > >>> ./t.rules/run:  my $testsline =
>     > >>> $status->get_names_of_tests_hit().",".$status->
>     get_names_of_subtests_hit();
>     > >>>
>     > >>>
>     > >>>
>     > >>>
>     > >>> On Tue, Jun 04, 2019 at 01:56:26AM -0400, Kevin A. McGrail wrote:
>     > >>>> Morning All,
>     > >>>>
>     > >>>> After a few thoughts on limits, it appears that any duplicate
>     subtest
>     > >>>> hits are best combined for debug output.
>     > >>>>
>     > >>>> Any thoughts on the attached?  It looks like it will help me with
>     rule
>     > >>>> development while support rules with valid but large maxhits like
>     > >>> __LOWER_E
>     > >>>>
>     > >>>> Regards,
>     > >>>> KAM
>     > >>>>
>     > >>>> On 5/31/2019 10:30 AM, Bill Cole wrote:
>     > >>>>> On 30 May 2019, at 20:35, Kevin A. McGrail wrote:
>     > >>>>>
>     > >>>>>> I was curious if anyone noticed the debug output for subtests has
>     > >>> gotten
>     > >>>>>> insane:
>     > >>>>>
>     > >>>>> It got a little discussion on users@ when I created those rules.
>     > >>>>>
>     > >>>>> [...]
>     > >>>>>
>     > >>>>>> [3]72_active.cf:    body            __LOWER_E       /e/
>     > >>>>>> [4]72_active.cf:    tflags          __LOWER_E       multiple
>     > >>>>>> maxhits=230
>     > >>>>>>
>     > >>>>>> [5]72_active.cf:    body            __E_LIKE_LETTER /<lcase_e>/
>     > >>>>>> [6]72_active.cf:    tflags          __E_LIKE_LETTER multiple
>     > >>>>>> maxhits=320
>     > >>>>>>
>     > >>>>>> Assuming those maxhits are correct,
>     > >>>>>
>     > >>>>> They are. In fact they were carefully tuned to catch the targeted
>     > >>>>> extortion spam.
>     > >>>>>
>     > >>>>>> maybe we need something in the debug
>     > >>>>>> output that says __E_LIKE_LETTER (number of hits if more than 1).
>     > >>>>>
>     > >>>>> That would be a useful enhancement even without my flagrant log
>     > >>>>> vandalism.
>     > >>>>>
>     > >>>>
>     > >>>> --
>     > >>>> Kevin A. McGrail
>     > >>>> Member, Apache Software Foundation
>     > >>>> Chair Emeritus Apache SpamAssassin Project
>     > >>>> [7]https://www.linkedin.com/in/kmcgrail - 703.798.0171
>     > >>>>
>     > >>>
>     > >>>> Index: lib/Mail/SpamAssassin/PerMsgStatus.pm
>     > >>>> ===================================================================
>     > >>>> --- lib/Mail/SpamAssassin/PerMsgStatus.pm       (revision 1860582)
>     > >>>> +++ lib/Mail/SpamAssassin/PerMsgStatus.pm       (working copy)
>     > >>>> @@ -769,7 +769,38 @@
>     > >>>>  sub get_names_of_subtests_hit {
>     > >>>>    my ($self) = @_;
>     > >>>>
>     > >>>> -  return join(',', sort @{$self->{subtest_names_hit}});
>     > >>>> +  #return join(',', sort @{$self->{subtest_names_hit}});
>     > >>>> +
>     > >>>> +  #This routine prints only one instance of a subrule hit with a
>     > >>>> count
>     > >>> of how many times it hit if greater than 1
>     > >>>> +  my (%subtest_names_hit, $i, $key, @keys, @sorted, $string, $rule,
>     > >>> $total_hits, $deduplicated_hits);
>     > >>>> +
>     > >>>> +  $total_hits = scalar(@{$self->{subtest_names_hit}});
>     > >>>> +
>     > >>>> +  for ($i=0; $i < $total_hits; $i++) {
>     > >>>> +    $rule = ${$self->{subtest_names_hit}}[$i];
>     > >>>> +    $subtest_names_hit{$rule}++;
>     > >>>> +  }
>     > >>>> +
>     > >>>> +  foreach $key (keys %subtest_names_hit) {
>     > >>>> +    push (@keys, $key);
>     > >>>> +  }
>     > >>>> +  @sorted = sort @keys;
>     > >>>> +
>     > >>>> +  $deduplicated_hits = scalar(@sorted);
>     > >>>> +
>     > >>>> +  for ($i=0; $i < $deduplicated_hits; $i++) {
>     > >>>> +    $string .= $sorted[$i];
>     > >>>> +    if ($subtest_names_hit{$sorted[$i]} > 1) {
>     > >>>> +      $string .= "($subtest_names_hit{$sorted[$i]})"
>     > >>>> +    }
>     > >>>> +    $string .= ",";
>     > >>>> +  }
>     > >>>> +
>     > >>>> +  $string =~ s/,$//;
>     > >>>> +
>     > >>>> +  $string .= " (Total Subtest Hits: $total_hits / Deduplicated
>     Total
>     > >>> Hits: $deduplicated_hits)";
>     > >>>> +
>     > >>>> +  return $string;
>     > >>>>  }
>     > >>>>
>     > >>>>
>     > >>> #####################################################################
>     ######
>     > >>>
>     > >>>
>     > >>>
>     > >>
>     > >
>     >
>     > --
>     > Kevin A. McGrail
>     > Member, Apache Software Foundation
>     > Chair Emeritus Apache SpamAssassin Project
>     > [8]https://www.linkedin.com/in/kmcgrail - 703.798.0171
>     >
> 
>     > Index: lib/Mail/SpamAssassin/PerMsgStatus.pm
>     > ===================================================================
>     > --- lib/Mail/SpamAssassin/PerMsgStatus.pm       (revision 1860582)
>     > +++ lib/Mail/SpamAssassin/PerMsgStatus.pm       (working copy)
>     > @@ -398,7 +398,7 @@
>     >    dbg("check: is spam? score=".$self->{score}.
>     >                          " required=".$self->{conf}->{required_score});
>     >    dbg("check: tests=".$self->get_names_of_tests_hit());
>     > -  dbg("check: subtests=".$self->get_names_of_subtests_hit());
>     > +  dbg("check: subtests=".$self->get_names_of_subtests_hit("dbg"));
>     >    $self->{is_spam} = $self->is_spam();
>     > 
>     >    $self->{main}->{resolver}->bgabort();
>     > @@ -764,12 +764,52 @@
>     >  normally-hidden rules, which score 0 and have names beginning with two
>     >  underscores, used in meta rules.
>     > 
>     > +If a parameter of dbg is passed, the output will be more condensed and
>     > +sub-tests with multiple hits reduced to one entry with the number of
>     hits
>     > +in parentheses. Some information is also added at the end regarding the
>     > +multiple hits.
>     > +
>     >  =cut
>     > 
>     >  sub get_names_of_subtests_hit {
>     > -  my ($self) = @_;
>     > +  my ($self, $mode) = @_;
>     > 
>     > -  return join(',', sort @{$self->{subtest_names_hit}});
>     > +  if (defined $mode && $mode eq 'dbg') {
>     > +    #This routine prints only one instance of a subrule hit with a count
>     of how many times it hit if greater than 1
>     > +    my (%subtest_names_hit, $i, $key, @keys, @sorted, $string, $rule,
>     $total_hits, $deduplicated_hits); 
>     > + 
>     > +    $total_hits = scalar(@{$self->{subtest_names_hit}});
>     > + 
>     > +    for ($i=0; $i < $total_hits; $i++) {
>     > +      $rule = ${$self->{subtest_names_hit}}[$i];
>     > +      $subtest_names_hit{$rule}++;
>     > +    }
>     > + 
>     > +    foreach $key (keys %subtest_names_hit) {
>     > +      push (@keys, $key);
>     > +    }
>     > +    @sorted = sort @keys;
>     > + 
>     > +    $deduplicated_hits = scalar(@sorted);
>     > + 
>     > +    for ($i=0; $i < $deduplicated_hits; $i++) {
>     > +      $string .= $sorted[$i];
>     > +      if ($subtest_names_hit{$sorted[$i]} > 1) {
>     > +        $string .= "($subtest_names_hit{$sorted[$i]})"
>     > +      }
>     > +      $string .= ",";
>     > +    }
>     > + 
>     > +    $string =~ s/,$//;
>     > + 
>     > +    $string .= " (Total Subtest Hits: $total_hits / Deduplicated Total
>     Hits: $deduplicated_hits)";
>     > + 
>     > +    return $string;
>     > +
>     > +  } else {
>     > +    #return the simpler string with duplicates and commas
>     > +    return join(',', sort @{$self->{subtest_names_hit}});
>     > +  }
>     >  }
>     > 
>     >  ########################################################################
>     ###
> 
> 
> 
> On Thu, Jun 6, 2019, 01:30 Henrik K <[9...@hege.li> wrote:
> 
> 
>     Well in theory you see _more_ debug output now when there are no
>     duplicates,
>     due to the stats string..  honestly atleast I wouldn't care about that.
>     Feel free to vote.
> 
>     As a silly morning exercise, here's a one-liner that compacts stuff :-P
> 
>     my $foo = '__A,__B,__C,__C,__C,__CC,__D,__D,__E,__E';
>     my $m; $foo =~ s/([^,]+)(?{$m=1})(?:,\1(?=,|$)(?{$m++}))+/"$1($m)"/eg;
> 
>     __A,__B,__C(3),__CC,__D(2),__E(2)
> 
> 
>     On Wed, Jun 05, 2019 at 08:25:00PM -0400, Kevin A. McGrail wrote:
>     > Good point, Henrik & John.
>     >
>     > OK, I've left the output alone except for the calls from dbg so it
>     > shouldn't break anything in the public interface.
>     >
>     > Thoughts on this version?
>     >
>     > Regards,
>     > KAM
>     >
>     > On 6/4/2019 1:51 PM, John Hardin wrote:
>     > > On Tue, 4 Jun 2019, Kevin A. McGrail wrote:
>     > >
>     > >> Yes, I was thinking about that and wanting to fix uritests so well
>     > >> for the
>     > >> template.   Thanks for the feedback.  I will take another pass at it.
>     > >
>     > > Just do the deduplication without modifying the output format.
>     > >
>     > > If we want to log the hit counts, then make another function that does
>     > > what you did and use it for logging.
>     > >
>     > >
>     > >> On Tue, Jun 4, 2019, 03:23 Henrik K <[1...@hege.li> wrote:
>     > >>
>     > >>>
>     > >>> If you want to modify debug output, you have to modify only the dbg()
>     > >>> output
>     > >>> itself.  You can't modify internal functions that have specific
>     output
>     > >>> formats and start adding random strings to them.  Atleast these
>     places
>     > >>> depend on the comma delimited rules:
>     > >>>
>     > >>> ./masses/mass-check:    push @tests, split(/,/,
>     > >>> $status->get_names_of_subtests_hit());
>     > >>> ./t/rule_tests.t:    my %rules_hit = map { $_ => 1 }
>     > >>> split(/,/,$msg->get_names_of_tests_hit()),
>     > >>> split(/,/,$msg->get_names_of_subtests_hit());
>     > >>> ./t.rules/run:  my $testsline =
>     > >>> $status->get_names_of_tests_hit().",".$status->
>     get_names_of_subtests_hit();
>     > >>>
>     > >>>
>     > >>>
>     > >>>
>     > >>> On Tue, Jun 04, 2019 at 01:56:26AM -0400, Kevin A. McGrail wrote:
>     > >>>> Morning All,
>     > >>>>
>     > >>>> After a few thoughts on limits, it appears that any duplicate
>     subtest
>     > >>>> hits are best combined for debug output.
>     > >>>>
>     > >>>> Any thoughts on the attached?  It looks like it will help me with
>     rule
>     > >>>> development while support rules with valid but large maxhits like
>     > >>> __LOWER_E
>     > >>>>
>     > >>>> Regards,
>     > >>>> KAM
>     > >>>>
>     > >>>> On 5/31/2019 10:30 AM, Bill Cole wrote:
>     > >>>>> On 30 May 2019, at 20:35, Kevin A. McGrail wrote:
>     > >>>>>
>     > >>>>>> I was curious if anyone noticed the debug output for subtests has
>     > >>> gotten
>     > >>>>>> insane:
>     > >>>>>
>     > >>>>> It got a little discussion on users@ when I created those rules.
>     > >>>>>
>     > >>>>> [...]
>     > >>>>>
>     > >>>>>> [11]72_active.cf:    body            __LOWER_E       /e/
>     > >>>>>> [12]72_active.cf:    tflags          __LOWER_E       multiple
>     > >>>>>> maxhits=230
>     > >>>>>>
>     > >>>>>> [13]72_active.cf:    body            __E_LIKE_LETTER /<lcase_e>/
>     > >>>>>> [14]72_active.cf:    tflags          __E_LIKE_LETTER multiple
>     > >>>>>> maxhits=320
>     > >>>>>>
>     > >>>>>> Assuming those maxhits are correct,
>     > >>>>>
>     > >>>>> They are. In fact they were carefully tuned to catch the targeted
>     > >>>>> extortion spam.
>     > >>>>>
>     > >>>>>> maybe we need something in the debug
>     > >>>>>> output that says __E_LIKE_LETTER (number of hits if more than 1).
>     > >>>>>
>     > >>>>> That would be a useful enhancement even without my flagrant log
>     > >>>>> vandalism.
>     > >>>>>
>     > >>>>
>     > >>>> --
>     > >>>> Kevin A. McGrail
>     > >>>> Member, Apache Software Foundation
>     > >>>> Chair Emeritus Apache SpamAssassin Project
>     > >>>> [15]https://www.linkedin.com/in/kmcgrail - 703.798.0171
>     > >>>>
>     > >>>
>     > >>>> Index: lib/Mail/SpamAssassin/PerMsgStatus.pm
>     > >>>> ===================================================================
>     > >>>> --- lib/Mail/SpamAssassin/PerMsgStatus.pm       (revision 1860582)
>     > >>>> +++ lib/Mail/SpamAssassin/PerMsgStatus.pm       (working copy)
>     > >>>> @@ -769,7 +769,38 @@
>     > >>>>  sub get_names_of_subtests_hit {
>     > >>>>    my ($self) = @_;
>     > >>>>
>     > >>>> -  return join(',', sort @{$self->{subtest_names_hit}});
>     > >>>> +  #return join(',', sort @{$self->{subtest_names_hit}});
>     > >>>> +
>     > >>>> +  #This routine prints only one instance of a subrule hit with a
>     > >>>> count
>     > >>> of how many times it hit if greater than 1
>     > >>>> +  my (%subtest_names_hit, $i, $key, @keys, @sorted, $string, $rule,
>     > >>> $total_hits, $deduplicated_hits);
>     > >>>> +
>     > >>>> +  $total_hits = scalar(@{$self->{subtest_names_hit}});
>     > >>>> +
>     > >>>> +  for ($i=0; $i < $total_hits; $i++) {
>     > >>>> +    $rule = ${$self->{subtest_names_hit}}[$i];
>     > >>>> +    $subtest_names_hit{$rule}++;
>     > >>>> +  }
>     > >>>> +
>     > >>>> +  foreach $key (keys %subtest_names_hit) {
>     > >>>> +    push (@keys, $key);
>     > >>>> +  }
>     > >>>> +  @sorted = sort @keys;
>     > >>>> +
>     > >>>> +  $deduplicated_hits = scalar(@sorted);
>     > >>>> +
>     > >>>> +  for ($i=0; $i < $deduplicated_hits; $i++) {
>     > >>>> +    $string .= $sorted[$i];
>     > >>>> +    if ($subtest_names_hit{$sorted[$i]} > 1) {
>     > >>>> +      $string .= "($subtest_names_hit{$sorted[$i]})"
>     > >>>> +    }
>     > >>>> +    $string .= ",";
>     > >>>> +  }
>     > >>>> +
>     > >>>> +  $string =~ s/,$//;
>     > >>>> +
>     > >>>> +  $string .= " (Total Subtest Hits: $total_hits / Deduplicated
>     Total
>     > >>> Hits: $deduplicated_hits)";
>     > >>>> +
>     > >>>> +  return $string;
>     > >>>>  }
>     > >>>>
>     > >>>>
>     > >>> #####################################################################
>     ######
>     > >>>
>     > >>>
>     > >>>
>     > >>
>     > >
>     >
>     > --
>     > Kevin A. McGrail
>     > Member, Apache Software Foundation
>     > Chair Emeritus Apache SpamAssassin Project
>     > [16]https://www.linkedin.com/in/kmcgrail - 703.798.0171
>     >
> 
>     > Index: lib/Mail/SpamAssassin/PerMsgStatus.pm
>     > ===================================================================
>     > --- lib/Mail/SpamAssassin/PerMsgStatus.pm       (revision 1860582)
>     > +++ lib/Mail/SpamAssassin/PerMsgStatus.pm       (working copy)
>     > @@ -398,7 +398,7 @@
>     >    dbg("check: is spam? score=".$self->{score}.
>     >                          " required=".$self->{conf}->{required_score});
>     >    dbg("check: tests=".$self->get_names_of_tests_hit());
>     > -  dbg("check: subtests=".$self->get_names_of_subtests_hit());
>     > +  dbg("check: subtests=".$self->get_names_of_subtests_hit("dbg"));
>     >    $self->{is_spam} = $self->is_spam();
>     > 
>     >    $self->{main}->{resolver}->bgabort();
>     > @@ -764,12 +764,52 @@
>     >  normally-hidden rules, which score 0 and have names beginning with two
>     >  underscores, used in meta rules.
>     > 
>     > +If a parameter of dbg is passed, the output will be more condensed and
>     > +sub-tests with multiple hits reduced to one entry with the number of
>     hits
>     > +in parentheses. Some information is also added at the end regarding the
>     > +multiple hits.
>     > +
>     >  =cut
>     > 
>     >  sub get_names_of_subtests_hit {
>     > -  my ($self) = @_;
>     > +  my ($self, $mode) = @_;
>     > 
>     > -  return join(',', sort @{$self->{subtest_names_hit}});
>     > +  if (defined $mode && $mode eq 'dbg') {
>     > +    #This routine prints only one instance of a subrule hit with a count
>     of how many times it hit if greater than 1
>     > +    my (%subtest_names_hit, $i, $key, @keys, @sorted, $string, $rule,
>     $total_hits, $deduplicated_hits); 
>     > + 
>     > +    $total_hits = scalar(@{$self->{subtest_names_hit}});
>     > + 
>     > +    for ($i=0; $i < $total_hits; $i++) {
>     > +      $rule = ${$self->{subtest_names_hit}}[$i];
>     > +      $subtest_names_hit{$rule}++;
>     > +    }
>     > + 
>     > +    foreach $key (keys %subtest_names_hit) {
>     > +      push (@keys, $key);
>     > +    }
>     > +    @sorted = sort @keys;
>     > + 
>     > +    $deduplicated_hits = scalar(@sorted);
>     > + 
>     > +    for ($i=0; $i < $deduplicated_hits; $i++) {
>     > +      $string .= $sorted[$i];
>     > +      if ($subtest_names_hit{$sorted[$i]} > 1) {
>     > +        $string .= "($subtest_names_hit{$sorted[$i]})"
>     > +      }
>     > +      $string .= ",";
>     > +    }
>     > + 
>     > +    $string =~ s/,$//;
>     > + 
>     > +    $string .= " (Total Subtest Hits: $total_hits / Deduplicated Total
>     Hits: $deduplicated_hits)";
>     > + 
>     > +    return $string;
>     > +
>     > +  } else {
>     > +    #return the simpler string with duplicates and commas
>     > +    return join(',', sort @{$self->{subtest_names_hit}});
>     > +  }
>     >  }
>     > 
>     >  ########################################################################
>     ###
> 
> 
> 
> References:
> 
> [1] mailto:hege@hege.li
> [2] mailto:hege@hege.li
> [3] http://72_active.cf/
> [4] http://72_active.cf/
> [5] http://72_active.cf/
> [6] http://72_active.cf/
> [7] https://www.linkedin.com/in/kmcgrail
> [8] https://www.linkedin.com/in/kmcgrail
> [9] mailto:hege@hege.li
> [10] mailto:hege@hege.li
> [11] http://72_active.cf/
> [12] http://72_active.cf/
> [13] http://72_active.cf/
> [14] http://72_active.cf/
> [15] https://www.linkedin.com/in/kmcgrail
> [16] https://www.linkedin.com/in/kmcgrail

Re: __E_LIKE_LETTER & __LOWER_E filling subtests debug

Posted by "Kevin A. McGrail" <km...@apache.org>.
That is a frightening one liner.  Should we use it?

As for the more output comment, if you have emails with 300 lower case e's,
you get 300 hits for the subtext.  It is unreadable for rule analysis.

As for modifying the normal output, I have no idea if anyone out there is
using the public routine so better to be safe.

I didn't find a tag for subtests either. That might be a good 4.0 addition.

Regards, KAM

On Thu, Jun 6, 2019, 01:30 Henrik K <he...@hege.li> wrote:

>
> Well in theory you see _more_ debug output now when there are no
> duplicates,
> due to the stats string..  honestly atleast I wouldn't care about that.
> Feel free to vote.
>
> As a silly morning exercise, here's a one-liner that compacts stuff :-P
>
> my $foo = '__A,__B,__C,__C,__C,__CC,__D,__D,__E,__E';
> my $m; $foo =~ s/([^,]+)(?{$m=1})(?:,\1(?=,|$)(?{$m++}))+/"$1($m)"/eg;
>
> __A,__B,__C(3),__CC,__D(2),__E(2)
>
>
> On Wed, Jun 05, 2019 at 08:25:00PM -0400, Kevin A. McGrail wrote:
> > Good point, Henrik & John.
> >
> > OK, I've left the output alone except for the calls from dbg so it
> > shouldn't break anything in the public interface.
> >
> > Thoughts on this version?
> >
> > Regards,
> > KAM
> >
> > On 6/4/2019 1:51 PM, John Hardin wrote:
> > > On Tue, 4 Jun 2019, Kevin A. McGrail wrote:
> > >
> > >> Yes, I was thinking about that and wanting to fix uritests so well
> > >> for the
> > >> template.   Thanks for the feedback.  I will take another pass at it.
> > >
> > > Just do the deduplication without modifying the output format.
> > >
> > > If we want to log the hit counts, then make another function that does
> > > what you did and use it for logging.
> > >
> > >
> > >> On Tue, Jun 4, 2019, 03:23 Henrik K <he...@hege.li> wrote:
> > >>
> > >>>
> > >>> If you want to modify debug output, you have to modify only the dbg()
> > >>> output
> > >>> itself.  You can't modify internal functions that have specific
> output
> > >>> formats and start adding random strings to them.  Atleast these
> places
> > >>> depend on the comma delimited rules:
> > >>>
> > >>> ./masses/mass-check:    push @tests, split(/,/,
> > >>> $status->get_names_of_subtests_hit());
> > >>> ./t/rule_tests.t:    my %rules_hit = map { $_ => 1 }
> > >>> split(/,/,$msg->get_names_of_tests_hit()),
> > >>> split(/,/,$msg->get_names_of_subtests_hit());
> > >>> ./t.rules/run:  my $testsline =
> > >>>
> $status->get_names_of_tests_hit().",".$status->get_names_of_subtests_hit();
> > >>>
> > >>>
> > >>>
> > >>>
> > >>> On Tue, Jun 04, 2019 at 01:56:26AM -0400, Kevin A. McGrail wrote:
> > >>>> Morning All,
> > >>>>
> > >>>> After a few thoughts on limits, it appears that any duplicate
> subtest
> > >>>> hits are best combined for debug output.
> > >>>>
> > >>>> Any thoughts on the attached?  It looks like it will help me with
> rule
> > >>>> development while support rules with valid but large maxhits like
> > >>> __LOWER_E
> > >>>>
> > >>>> Regards,
> > >>>> KAM
> > >>>>
> > >>>> On 5/31/2019 10:30 AM, Bill Cole wrote:
> > >>>>> On 30 May 2019, at 20:35, Kevin A. McGrail wrote:
> > >>>>>
> > >>>>>> I was curious if anyone noticed the debug output for subtests has
> > >>> gotten
> > >>>>>> insane:
> > >>>>>
> > >>>>> It got a little discussion on users@ when I created those rules.
> > >>>>>
> > >>>>> [...]
> > >>>>>
> > >>>>>> 72_active.cf:    body            __LOWER_E       /e/
> > >>>>>> 72_active.cf:    tflags          __LOWER_E       multiple
> > >>>>>> maxhits=230
> > >>>>>>
> > >>>>>> 72_active.cf:    body            __E_LIKE_LETTER /<lcase_e>/
> > >>>>>> 72_active.cf:    tflags          __E_LIKE_LETTER multiple
> > >>>>>> maxhits=320
> > >>>>>>
> > >>>>>> Assuming those maxhits are correct,
> > >>>>>
> > >>>>> They are. In fact they were carefully tuned to catch the targeted
> > >>>>> extortion spam.
> > >>>>>
> > >>>>>> maybe we need something in the debug
> > >>>>>> output that says __E_LIKE_LETTER (number of hits if more than 1).
> > >>>>>
> > >>>>> That would be a useful enhancement even without my flagrant log
> > >>>>> vandalism.
> > >>>>>
> > >>>>
> > >>>> --
> > >>>> Kevin A. McGrail
> > >>>> Member, Apache Software Foundation
> > >>>> Chair Emeritus Apache SpamAssassin Project
> > >>>> https://www.linkedin.com/in/kmcgrail - 703.798.0171
> > >>>>
> > >>>
> > >>>> Index: lib/Mail/SpamAssassin/PerMsgStatus.pm
> > >>>> ===================================================================
> > >>>> --- lib/Mail/SpamAssassin/PerMsgStatus.pm       (revision 1860582)
> > >>>> +++ lib/Mail/SpamAssassin/PerMsgStatus.pm       (working copy)
> > >>>> @@ -769,7 +769,38 @@
> > >>>>  sub get_names_of_subtests_hit {
> > >>>>    my ($self) = @_;
> > >>>>
> > >>>> -  return join(',', sort @{$self->{subtest_names_hit}});
> > >>>> +  #return join(',', sort @{$self->{subtest_names_hit}});
> > >>>> +
> > >>>> +  #This routine prints only one instance of a subrule hit with a
> > >>>> count
> > >>> of how many times it hit if greater than 1
> > >>>> +  my (%subtest_names_hit, $i, $key, @keys, @sorted, $string, $rule,
> > >>> $total_hits, $deduplicated_hits);
> > >>>> +
> > >>>> +  $total_hits = scalar(@{$self->{subtest_names_hit}});
> > >>>> +
> > >>>> +  for ($i=0; $i < $total_hits; $i++) {
> > >>>> +    $rule = ${$self->{subtest_names_hit}}[$i];
> > >>>> +    $subtest_names_hit{$rule}++;
> > >>>> +  }
> > >>>> +
> > >>>> +  foreach $key (keys %subtest_names_hit) {
> > >>>> +    push (@keys, $key);
> > >>>> +  }
> > >>>> +  @sorted = sort @keys;
> > >>>> +
> > >>>> +  $deduplicated_hits = scalar(@sorted);
> > >>>> +
> > >>>> +  for ($i=0; $i < $deduplicated_hits; $i++) {
> > >>>> +    $string .= $sorted[$i];
> > >>>> +    if ($subtest_names_hit{$sorted[$i]} > 1) {
> > >>>> +      $string .= "($subtest_names_hit{$sorted[$i]})"
> > >>>> +    }
> > >>>> +    $string .= ",";
> > >>>> +  }
> > >>>> +
> > >>>> +  $string =~ s/,$//;
> > >>>> +
> > >>>> +  $string .= " (Total Subtest Hits: $total_hits / Deduplicated
> Total
> > >>> Hits: $deduplicated_hits)";
> > >>>> +
> > >>>> +  return $string;
> > >>>>  }
> > >>>>
> > >>>>
> > >>>
> ###########################################################################
> > >>>
> > >>>
> > >>>
> > >>
> > >
> >
> > --
> > Kevin A. McGrail
> > Member, Apache Software Foundation
> > Chair Emeritus Apache SpamAssassin Project
> > https://www.linkedin.com/in/kmcgrail - 703.798.0171
> >
>
> > Index: lib/Mail/SpamAssassin/PerMsgStatus.pm
> > ===================================================================
> > --- lib/Mail/SpamAssassin/PerMsgStatus.pm       (revision 1860582)
> > +++ lib/Mail/SpamAssassin/PerMsgStatus.pm       (working copy)
> > @@ -398,7 +398,7 @@
> >    dbg("check: is spam? score=".$self->{score}.
> >                          " required=".$self->{conf}->{required_score});
> >    dbg("check: tests=".$self->get_names_of_tests_hit());
> > -  dbg("check: subtests=".$self->get_names_of_subtests_hit());
> > +  dbg("check: subtests=".$self->get_names_of_subtests_hit("dbg"));
> >    $self->{is_spam} = $self->is_spam();
> >
> >    $self->{main}->{resolver}->bgabort();
> > @@ -764,12 +764,52 @@
> >  normally-hidden rules, which score 0 and have names beginning with two
> >  underscores, used in meta rules.
> >
> > +If a parameter of dbg is passed, the output will be more condensed and
> > +sub-tests with multiple hits reduced to one entry with the number of
> hits
> > +in parentheses. Some information is also added at the end regarding the
> > +multiple hits.
> > +
> >  =cut
> >
> >  sub get_names_of_subtests_hit {
> > -  my ($self) = @_;
> > +  my ($self, $mode) = @_;
> >
> > -  return join(',', sort @{$self->{subtest_names_hit}});
> > +  if (defined $mode && $mode eq 'dbg') {
> > +    #This routine prints only one instance of a subrule hit with a
> count of how many times it hit if greater than 1
> > +    my (%subtest_names_hit, $i, $key, @keys, @sorted, $string, $rule,
> $total_hits, $deduplicated_hits);
> > +
> > +    $total_hits = scalar(@{$self->{subtest_names_hit}});
> > +
> > +    for ($i=0; $i < $total_hits; $i++) {
> > +      $rule = ${$self->{subtest_names_hit}}[$i];
> > +      $subtest_names_hit{$rule}++;
> > +    }
> > +
> > +    foreach $key (keys %subtest_names_hit) {
> > +      push (@keys, $key);
> > +    }
> > +    @sorted = sort @keys;
> > +
> > +    $deduplicated_hits = scalar(@sorted);
> > +
> > +    for ($i=0; $i < $deduplicated_hits; $i++) {
> > +      $string .= $sorted[$i];
> > +      if ($subtest_names_hit{$sorted[$i]} > 1) {
> > +        $string .= "($subtest_names_hit{$sorted[$i]})"
> > +      }
> > +      $string .= ",";
> > +    }
> > +
> > +    $string =~ s/,$//;
> > +
> > +    $string .= " (Total Subtest Hits: $total_hits / Deduplicated Total
> Hits: $deduplicated_hits)";
> > +
> > +    return $string;
> > +
> > +  } else {
> > +    #return the simpler string with duplicates and commas
> > +    return join(',', sort @{$self->{subtest_names_hit}});
> > +  }
> >  }
> >
> >
> ###########################################################################
>
>
On Thu, Jun 6, 2019, 01:30 Henrik K <he...@hege.li> wrote:

>
> Well in theory you see _more_ debug output now when there are no
> duplicates,
> due to the stats string..  honestly atleast I wouldn't care about that.
> Feel free to vote.
>
> As a silly morning exercise, here's a one-liner that compacts stuff :-P
>
> my $foo = '__A,__B,__C,__C,__C,__CC,__D,__D,__E,__E';
> my $m; $foo =~ s/([^,]+)(?{$m=1})(?:,\1(?=,|$)(?{$m++}))+/"$1($m)"/eg;
>
> __A,__B,__C(3),__CC,__D(2),__E(2)
>
>
> On Wed, Jun 05, 2019 at 08:25:00PM -0400, Kevin A. McGrail wrote:
> > Good point, Henrik & John.
> >
> > OK, I've left the output alone except for the calls from dbg so it
> > shouldn't break anything in the public interface.
> >
> > Thoughts on this version?
> >
> > Regards,
> > KAM
> >
> > On 6/4/2019 1:51 PM, John Hardin wrote:
> > > On Tue, 4 Jun 2019, Kevin A. McGrail wrote:
> > >
> > >> Yes, I was thinking about that and wanting to fix uritests so well
> > >> for the
> > >> template.   Thanks for the feedback.  I will take another pass at it.
> > >
> > > Just do the deduplication without modifying the output format.
> > >
> > > If we want to log the hit counts, then make another function that does
> > > what you did and use it for logging.
> > >
> > >
> > >> On Tue, Jun 4, 2019, 03:23 Henrik K <he...@hege.li> wrote:
> > >>
> > >>>
> > >>> If you want to modify debug output, you have to modify only the dbg()
> > >>> output
> > >>> itself.  You can't modify internal functions that have specific
> output
> > >>> formats and start adding random strings to them.  Atleast these
> places
> > >>> depend on the comma delimited rules:
> > >>>
> > >>> ./masses/mass-check:    push @tests, split(/,/,
> > >>> $status->get_names_of_subtests_hit());
> > >>> ./t/rule_tests.t:    my %rules_hit = map { $_ => 1 }
> > >>> split(/,/,$msg->get_names_of_tests_hit()),
> > >>> split(/,/,$msg->get_names_of_subtests_hit());
> > >>> ./t.rules/run:  my $testsline =
> > >>>
> $status->get_names_of_tests_hit().",".$status->get_names_of_subtests_hit();
> > >>>
> > >>>
> > >>>
> > >>>
> > >>> On Tue, Jun 04, 2019 at 01:56:26AM -0400, Kevin A. McGrail wrote:
> > >>>> Morning All,
> > >>>>
> > >>>> After a few thoughts on limits, it appears that any duplicate
> subtest
> > >>>> hits are best combined for debug output.
> > >>>>
> > >>>> Any thoughts on the attached?  It looks like it will help me with
> rule
> > >>>> development while support rules with valid but large maxhits like
> > >>> __LOWER_E
> > >>>>
> > >>>> Regards,
> > >>>> KAM
> > >>>>
> > >>>> On 5/31/2019 10:30 AM, Bill Cole wrote:
> > >>>>> On 30 May 2019, at 20:35, Kevin A. McGrail wrote:
> > >>>>>
> > >>>>>> I was curious if anyone noticed the debug output for subtests has
> > >>> gotten
> > >>>>>> insane:
> > >>>>>
> > >>>>> It got a little discussion on users@ when I created those rules.
> > >>>>>
> > >>>>> [...]
> > >>>>>
> > >>>>>> 72_active.cf:    body            __LOWER_E       /e/
> > >>>>>> 72_active.cf:    tflags          __LOWER_E       multiple
> > >>>>>> maxhits=230
> > >>>>>>
> > >>>>>> 72_active.cf:    body            __E_LIKE_LETTER /<lcase_e>/
> > >>>>>> 72_active.cf:    tflags          __E_LIKE_LETTER multiple
> > >>>>>> maxhits=320
> > >>>>>>
> > >>>>>> Assuming those maxhits are correct,
> > >>>>>
> > >>>>> They are. In fact they were carefully tuned to catch the targeted
> > >>>>> extortion spam.
> > >>>>>
> > >>>>>> maybe we need something in the debug
> > >>>>>> output that says __E_LIKE_LETTER (number of hits if more than 1).
> > >>>>>
> > >>>>> That would be a useful enhancement even without my flagrant log
> > >>>>> vandalism.
> > >>>>>
> > >>>>
> > >>>> --
> > >>>> Kevin A. McGrail
> > >>>> Member, Apache Software Foundation
> > >>>> Chair Emeritus Apache SpamAssassin Project
> > >>>> https://www.linkedin.com/in/kmcgrail - 703.798.0171
> > >>>>
> > >>>
> > >>>> Index: lib/Mail/SpamAssassin/PerMsgStatus.pm
> > >>>> ===================================================================
> > >>>> --- lib/Mail/SpamAssassin/PerMsgStatus.pm       (revision 1860582)
> > >>>> +++ lib/Mail/SpamAssassin/PerMsgStatus.pm       (working copy)
> > >>>> @@ -769,7 +769,38 @@
> > >>>>  sub get_names_of_subtests_hit {
> > >>>>    my ($self) = @_;
> > >>>>
> > >>>> -  return join(',', sort @{$self->{subtest_names_hit}});
> > >>>> +  #return join(',', sort @{$self->{subtest_names_hit}});
> > >>>> +
> > >>>> +  #This routine prints only one instance of a subrule hit with a
> > >>>> count
> > >>> of how many times it hit if greater than 1
> > >>>> +  my (%subtest_names_hit, $i, $key, @keys, @sorted, $string, $rule,
> > >>> $total_hits, $deduplicated_hits);
> > >>>> +
> > >>>> +  $total_hits = scalar(@{$self->{subtest_names_hit}});
> > >>>> +
> > >>>> +  for ($i=0; $i < $total_hits; $i++) {
> > >>>> +    $rule = ${$self->{subtest_names_hit}}[$i];
> > >>>> +    $subtest_names_hit{$rule}++;
> > >>>> +  }
> > >>>> +
> > >>>> +  foreach $key (keys %subtest_names_hit) {
> > >>>> +    push (@keys, $key);
> > >>>> +  }
> > >>>> +  @sorted = sort @keys;
> > >>>> +
> > >>>> +  $deduplicated_hits = scalar(@sorted);
> > >>>> +
> > >>>> +  for ($i=0; $i < $deduplicated_hits; $i++) {
> > >>>> +    $string .= $sorted[$i];
> > >>>> +    if ($subtest_names_hit{$sorted[$i]} > 1) {
> > >>>> +      $string .= "($subtest_names_hit{$sorted[$i]})"
> > >>>> +    }
> > >>>> +    $string .= ",";
> > >>>> +  }
> > >>>> +
> > >>>> +  $string =~ s/,$//;
> > >>>> +
> > >>>> +  $string .= " (Total Subtest Hits: $total_hits / Deduplicated
> Total
> > >>> Hits: $deduplicated_hits)";
> > >>>> +
> > >>>> +  return $string;
> > >>>>  }
> > >>>>
> > >>>>
> > >>>
> ###########################################################################
> > >>>
> > >>>
> > >>>
> > >>
> > >
> >
> > --
> > Kevin A. McGrail
> > Member, Apache Software Foundation
> > Chair Emeritus Apache SpamAssassin Project
> > https://www.linkedin.com/in/kmcgrail - 703.798.0171
> >
>
> > Index: lib/Mail/SpamAssassin/PerMsgStatus.pm
> > ===================================================================
> > --- lib/Mail/SpamAssassin/PerMsgStatus.pm       (revision 1860582)
> > +++ lib/Mail/SpamAssassin/PerMsgStatus.pm       (working copy)
> > @@ -398,7 +398,7 @@
> >    dbg("check: is spam? score=".$self->{score}.
> >                          " required=".$self->{conf}->{required_score});
> >    dbg("check: tests=".$self->get_names_of_tests_hit());
> > -  dbg("check: subtests=".$self->get_names_of_subtests_hit());
> > +  dbg("check: subtests=".$self->get_names_of_subtests_hit("dbg"));
> >    $self->{is_spam} = $self->is_spam();
> >
> >    $self->{main}->{resolver}->bgabort();
> > @@ -764,12 +764,52 @@
> >  normally-hidden rules, which score 0 and have names beginning with two
> >  underscores, used in meta rules.
> >
> > +If a parameter of dbg is passed, the output will be more condensed and
> > +sub-tests with multiple hits reduced to one entry with the number of
> hits
> > +in parentheses. Some information is also added at the end regarding the
> > +multiple hits.
> > +
> >  =cut
> >
> >  sub get_names_of_subtests_hit {
> > -  my ($self) = @_;
> > +  my ($self, $mode) = @_;
> >
> > -  return join(',', sort @{$self->{subtest_names_hit}});
> > +  if (defined $mode && $mode eq 'dbg') {
> > +    #This routine prints only one instance of a subrule hit with a
> count of how many times it hit if greater than 1
> > +    my (%subtest_names_hit, $i, $key, @keys, @sorted, $string, $rule,
> $total_hits, $deduplicated_hits);
> > +
> > +    $total_hits = scalar(@{$self->{subtest_names_hit}});
> > +
> > +    for ($i=0; $i < $total_hits; $i++) {
> > +      $rule = ${$self->{subtest_names_hit}}[$i];
> > +      $subtest_names_hit{$rule}++;
> > +    }
> > +
> > +    foreach $key (keys %subtest_names_hit) {
> > +      push (@keys, $key);
> > +    }
> > +    @sorted = sort @keys;
> > +
> > +    $deduplicated_hits = scalar(@sorted);
> > +
> > +    for ($i=0; $i < $deduplicated_hits; $i++) {
> > +      $string .= $sorted[$i];
> > +      if ($subtest_names_hit{$sorted[$i]} > 1) {
> > +        $string .= "($subtest_names_hit{$sorted[$i]})"
> > +      }
> > +      $string .= ",";
> > +    }
> > +
> > +    $string =~ s/,$//;
> > +
> > +    $string .= " (Total Subtest Hits: $total_hits / Deduplicated Total
> Hits: $deduplicated_hits)";
> > +
> > +    return $string;
> > +
> > +  } else {
> > +    #return the simpler string with duplicates and commas
> > +    return join(',', sort @{$self->{subtest_names_hit}});
> > +  }
> >  }
> >
> >
> ###########################################################################
>
>

Re: __E_LIKE_LETTER & __LOWER_E filling subtests debug

Posted by Henrik K <he...@hege.li>.
Well in theory you see _more_ debug output now when there are no duplicates,
due to the stats string..  honestly atleast I wouldn't care about that. 
Feel free to vote.

As a silly morning exercise, here's a one-liner that compacts stuff :-P

my $foo = '__A,__B,__C,__C,__C,__CC,__D,__D,__E,__E';
my $m; $foo =~ s/([^,]+)(?{$m=1})(?:,\1(?=,|$)(?{$m++}))+/"$1($m)"/eg;

__A,__B,__C(3),__CC,__D(2),__E(2)


On Wed, Jun 05, 2019 at 08:25:00PM -0400, Kevin A. McGrail wrote:
> Good point, Henrik & John.
> 
> OK, I've left the output alone except for the calls from dbg so it
> shouldn't break anything in the public interface.
> 
> Thoughts on this version?
> 
> Regards,
> KAM
> 
> On 6/4/2019 1:51 PM, John Hardin wrote:
> > On Tue, 4 Jun 2019, Kevin A. McGrail wrote:
> >
> >> Yes, I was thinking about that and wanting to fix uritests so well
> >> for the
> >> template.   Thanks for the feedback.  I will take another pass at it.
> >
> > Just do the deduplication without modifying the output format.
> >
> > If we want to log the hit counts, then make another function that does
> > what you did and use it for logging.
> >
> >
> >> On Tue, Jun 4, 2019, 03:23 Henrik K <he...@hege.li> wrote:
> >>
> >>>
> >>> If you want to modify debug output, you have to modify only the dbg()
> >>> output
> >>> itself.  You can't modify internal functions that have specific output
> >>> formats and start adding random strings to them.  Atleast these places
> >>> depend on the comma delimited rules:
> >>>
> >>> ./masses/mass-check:    push @tests, split(/,/,
> >>> $status->get_names_of_subtests_hit());
> >>> ./t/rule_tests.t:    my %rules_hit = map { $_ => 1 }
> >>> split(/,/,$msg->get_names_of_tests_hit()),
> >>> split(/,/,$msg->get_names_of_subtests_hit());
> >>> ./t.rules/run:  my $testsline =
> >>> $status->get_names_of_tests_hit().",".$status->get_names_of_subtests_hit();
> >>>
> >>>
> >>>
> >>>
> >>> On Tue, Jun 04, 2019 at 01:56:26AM -0400, Kevin A. McGrail wrote:
> >>>> Morning All,
> >>>>
> >>>> After a few thoughts on limits, it appears that any duplicate subtest
> >>>> hits are best combined for debug output.
> >>>>
> >>>> Any thoughts on the attached?  It looks like it will help me with rule
> >>>> development while support rules with valid but large maxhits like
> >>> __LOWER_E
> >>>>
> >>>> Regards,
> >>>> KAM
> >>>>
> >>>> On 5/31/2019 10:30 AM, Bill Cole wrote:
> >>>>> On 30 May 2019, at 20:35, Kevin A. McGrail wrote:
> >>>>>
> >>>>>> I was curious if anyone noticed the debug output for subtests has
> >>> gotten
> >>>>>> insane:
> >>>>>
> >>>>> It got a little discussion on users@ when I created those rules.
> >>>>>
> >>>>> [...]
> >>>>>
> >>>>>> 72_active.cf:    body            __LOWER_E       /e/
> >>>>>> 72_active.cf:    tflags          __LOWER_E       multiple
> >>>>>> maxhits=230
> >>>>>>
> >>>>>> 72_active.cf:    body            __E_LIKE_LETTER /<lcase_e>/
> >>>>>> 72_active.cf:    tflags          __E_LIKE_LETTER multiple
> >>>>>> maxhits=320
> >>>>>>
> >>>>>> Assuming those maxhits are correct,
> >>>>>
> >>>>> They are. In fact they were carefully tuned to catch the targeted
> >>>>> extortion spam.
> >>>>>
> >>>>>> maybe we need something in the debug
> >>>>>> output that says __E_LIKE_LETTER (number of hits if more than 1).
> >>>>>
> >>>>> That would be a useful enhancement even without my flagrant log
> >>>>> vandalism.
> >>>>>
> >>>>
> >>>> -- 
> >>>> Kevin A. McGrail
> >>>> Member, Apache Software Foundation
> >>>> Chair Emeritus Apache SpamAssassin Project
> >>>> https://www.linkedin.com/in/kmcgrail - 703.798.0171
> >>>>
> >>>
> >>>> Index: lib/Mail/SpamAssassin/PerMsgStatus.pm
> >>>> ===================================================================
> >>>> --- lib/Mail/SpamAssassin/PerMsgStatus.pm       (revision 1860582)
> >>>> +++ lib/Mail/SpamAssassin/PerMsgStatus.pm       (working copy)
> >>>> @@ -769,7 +769,38 @@
> >>>>  sub get_names_of_subtests_hit {
> >>>>    my ($self) = @_;
> >>>>
> >>>> -  return join(',', sort @{$self->{subtest_names_hit}});
> >>>> +  #return join(',', sort @{$self->{subtest_names_hit}});
> >>>> +
> >>>> +  #This routine prints only one instance of a subrule hit with a
> >>>> count
> >>> of how many times it hit if greater than 1
> >>>> +  my (%subtest_names_hit, $i, $key, @keys, @sorted, $string, $rule,
> >>> $total_hits, $deduplicated_hits);
> >>>> +
> >>>> +  $total_hits = scalar(@{$self->{subtest_names_hit}});
> >>>> +
> >>>> +  for ($i=0; $i < $total_hits; $i++) {
> >>>> +    $rule = ${$self->{subtest_names_hit}}[$i];
> >>>> +    $subtest_names_hit{$rule}++;
> >>>> +  }
> >>>> +
> >>>> +  foreach $key (keys %subtest_names_hit) {
> >>>> +    push (@keys, $key);
> >>>> +  }
> >>>> +  @sorted = sort @keys;
> >>>> +
> >>>> +  $deduplicated_hits = scalar(@sorted);
> >>>> +
> >>>> +  for ($i=0; $i < $deduplicated_hits; $i++) {
> >>>> +    $string .= $sorted[$i];
> >>>> +    if ($subtest_names_hit{$sorted[$i]} > 1) {
> >>>> +      $string .= "($subtest_names_hit{$sorted[$i]})"
> >>>> +    }
> >>>> +    $string .= ",";
> >>>> +  }
> >>>> +
> >>>> +  $string =~ s/,$//;
> >>>> +
> >>>> +  $string .= " (Total Subtest Hits: $total_hits / Deduplicated Total
> >>> Hits: $deduplicated_hits)";
> >>>> +
> >>>> +  return $string;
> >>>>  }
> >>>>
> >>>>
> >>> ###########################################################################
> >>>
> >>>
> >>>
> >>
> >
> 
> -- 
> Kevin A. McGrail
> Member, Apache Software Foundation
> Chair Emeritus Apache SpamAssassin Project
> https://www.linkedin.com/in/kmcgrail - 703.798.0171
> 

> Index: lib/Mail/SpamAssassin/PerMsgStatus.pm
> ===================================================================
> --- lib/Mail/SpamAssassin/PerMsgStatus.pm       (revision 1860582)
> +++ lib/Mail/SpamAssassin/PerMsgStatus.pm       (working copy)
> @@ -398,7 +398,7 @@
>    dbg("check: is spam? score=".$self->{score}.
>                          " required=".$self->{conf}->{required_score});
>    dbg("check: tests=".$self->get_names_of_tests_hit());
> -  dbg("check: subtests=".$self->get_names_of_subtests_hit());
> +  dbg("check: subtests=".$self->get_names_of_subtests_hit("dbg"));
>    $self->{is_spam} = $self->is_spam();
>  
>    $self->{main}->{resolver}->bgabort();
> @@ -764,12 +764,52 @@
>  normally-hidden rules, which score 0 and have names beginning with two
>  underscores, used in meta rules.
>  
> +If a parameter of dbg is passed, the output will be more condensed and 
> +sub-tests with multiple hits reduced to one entry with the number of hits 
> +in parentheses. Some information is also added at the end regarding the 
> +multiple hits.
> +
>  =cut
>  
>  sub get_names_of_subtests_hit {
> -  my ($self) = @_;
> +  my ($self, $mode) = @_;
>  
> -  return join(',', sort @{$self->{subtest_names_hit}});
> +  if (defined $mode && $mode eq 'dbg') {
> +    #This routine prints only one instance of a subrule hit with a count of how many times it hit if greater than 1
> +    my (%subtest_names_hit, $i, $key, @keys, @sorted, $string, $rule, $total_hits, $deduplicated_hits);  
> +  
> +    $total_hits = scalar(@{$self->{subtest_names_hit}});
> +  
> +    for ($i=0; $i < $total_hits; $i++) {
> +      $rule = ${$self->{subtest_names_hit}}[$i]; 
> +      $subtest_names_hit{$rule}++; 
> +    }
> +  
> +    foreach $key (keys %subtest_names_hit) {
> +      push (@keys, $key);
> +    }
> +    @sorted = sort @keys;
> +  
> +    $deduplicated_hits = scalar(@sorted);
> +  
> +    for ($i=0; $i < $deduplicated_hits; $i++) {
> +      $string .= $sorted[$i];
> +      if ($subtest_names_hit{$sorted[$i]} > 1) {
> +        $string .= "($subtest_names_hit{$sorted[$i]})"
> +      }
> +      $string .= ",";
> +    }
> +  
> +    $string =~ s/,$//;
> +  
> +    $string .= " (Total Subtest Hits: $total_hits / Deduplicated Total Hits: $deduplicated_hits)";
> +  
> +    return $string;
> +
> +  } else {
> +    #return the simpler string with duplicates and commas
> +    return join(',', sort @{$self->{subtest_names_hit}});
> +  }
>  }
>  
>  ###########################################################################


Re: __E_LIKE_LETTER & __LOWER_E filling subtests debug

Posted by "Kevin A. McGrail" <km...@apache.org>.
Good point, Henrik & John.

OK, I've left the output alone except for the calls from dbg so it
shouldn't break anything in the public interface.

Thoughts on this version?

Regards,
KAM

On 6/4/2019 1:51 PM, John Hardin wrote:
> On Tue, 4 Jun 2019, Kevin A. McGrail wrote:
>
>> Yes, I was thinking about that and wanting to fix uritests so well
>> for the
>> template.   Thanks for the feedback.  I will take another pass at it.
>
> Just do the deduplication without modifying the output format.
>
> If we want to log the hit counts, then make another function that does
> what you did and use it for logging.
>
>
>> On Tue, Jun 4, 2019, 03:23 Henrik K <he...@hege.li> wrote:
>>
>>>
>>> If you want to modify debug output, you have to modify only the dbg()
>>> output
>>> itself.  You can't modify internal functions that have specific output
>>> formats and start adding random strings to them.  Atleast these places
>>> depend on the comma delimited rules:
>>>
>>> ./masses/mass-check:    push @tests, split(/,/,
>>> $status->get_names_of_subtests_hit());
>>> ./t/rule_tests.t:    my %rules_hit = map { $_ => 1 }
>>> split(/,/,$msg->get_names_of_tests_hit()),
>>> split(/,/,$msg->get_names_of_subtests_hit());
>>> ./t.rules/run:  my $testsline =
>>> $status->get_names_of_tests_hit().",".$status->get_names_of_subtests_hit();
>>>
>>>
>>>
>>>
>>> On Tue, Jun 04, 2019 at 01:56:26AM -0400, Kevin A. McGrail wrote:
>>>> Morning All,
>>>>
>>>> After a few thoughts on limits, it appears that any duplicate subtest
>>>> hits are best combined for debug output.
>>>>
>>>> Any thoughts on the attached?  It looks like it will help me with rule
>>>> development while support rules with valid but large maxhits like
>>> __LOWER_E
>>>>
>>>> Regards,
>>>> KAM
>>>>
>>>> On 5/31/2019 10:30 AM, Bill Cole wrote:
>>>>> On 30 May 2019, at 20:35, Kevin A. McGrail wrote:
>>>>>
>>>>>> I was curious if anyone noticed the debug output for subtests has
>>> gotten
>>>>>> insane:
>>>>>
>>>>> It got a little discussion on users@ when I created those rules.
>>>>>
>>>>> [...]
>>>>>
>>>>>> 72_active.cf:    body            __LOWER_E       /e/
>>>>>> 72_active.cf:    tflags          __LOWER_E       multiple
>>>>>> maxhits=230
>>>>>>
>>>>>> 72_active.cf:    body            __E_LIKE_LETTER /<lcase_e>/
>>>>>> 72_active.cf:    tflags          __E_LIKE_LETTER multiple
>>>>>> maxhits=320
>>>>>>
>>>>>> Assuming those maxhits are correct,
>>>>>
>>>>> They are. In fact they were carefully tuned to catch the targeted
>>>>> extortion spam.
>>>>>
>>>>>> maybe we need something in the debug
>>>>>> output that says __E_LIKE_LETTER (number of hits if more than 1).
>>>>>
>>>>> That would be a useful enhancement even without my flagrant log
>>>>> vandalism.
>>>>>
>>>>
>>>> -- 
>>>> Kevin A. McGrail
>>>> Member, Apache Software Foundation
>>>> Chair Emeritus Apache SpamAssassin Project
>>>> https://www.linkedin.com/in/kmcgrail - 703.798.0171
>>>>
>>>
>>>> Index: lib/Mail/SpamAssassin/PerMsgStatus.pm
>>>> ===================================================================
>>>> --- lib/Mail/SpamAssassin/PerMsgStatus.pm       (revision 1860582)
>>>> +++ lib/Mail/SpamAssassin/PerMsgStatus.pm       (working copy)
>>>> @@ -769,7 +769,38 @@
>>>>  sub get_names_of_subtests_hit {
>>>>    my ($self) = @_;
>>>>
>>>> -  return join(',', sort @{$self->{subtest_names_hit}});
>>>> +  #return join(',', sort @{$self->{subtest_names_hit}});
>>>> +
>>>> +  #This routine prints only one instance of a subrule hit with a
>>>> count
>>> of how many times it hit if greater than 1
>>>> +  my (%subtest_names_hit, $i, $key, @keys, @sorted, $string, $rule,
>>> $total_hits, $deduplicated_hits);
>>>> +
>>>> +  $total_hits = scalar(@{$self->{subtest_names_hit}});
>>>> +
>>>> +  for ($i=0; $i < $total_hits; $i++) {
>>>> +    $rule = ${$self->{subtest_names_hit}}[$i];
>>>> +    $subtest_names_hit{$rule}++;
>>>> +  }
>>>> +
>>>> +  foreach $key (keys %subtest_names_hit) {
>>>> +    push (@keys, $key);
>>>> +  }
>>>> +  @sorted = sort @keys;
>>>> +
>>>> +  $deduplicated_hits = scalar(@sorted);
>>>> +
>>>> +  for ($i=0; $i < $deduplicated_hits; $i++) {
>>>> +    $string .= $sorted[$i];
>>>> +    if ($subtest_names_hit{$sorted[$i]} > 1) {
>>>> +      $string .= "($subtest_names_hit{$sorted[$i]})"
>>>> +    }
>>>> +    $string .= ",";
>>>> +  }
>>>> +
>>>> +  $string =~ s/,$//;
>>>> +
>>>> +  $string .= " (Total Subtest Hits: $total_hits / Deduplicated Total
>>> Hits: $deduplicated_hits)";
>>>> +
>>>> +  return $string;
>>>>  }
>>>>
>>>>
>>> ###########################################################################
>>>
>>>
>>>
>>
>

-- 
Kevin A. McGrail
Member, Apache Software Foundation
Chair Emeritus Apache SpamAssassin Project
https://www.linkedin.com/in/kmcgrail - 703.798.0171


Re: __E_LIKE_LETTER & __LOWER_E filling subtests debug

Posted by John Hardin <jh...@impsec.org>.
On Tue, 4 Jun 2019, Kevin A. McGrail wrote:

> Yes, I was thinking about that and wanting to fix uritests so well for the
> template.   Thanks for the feedback.  I will take another pass at it.

Just do the deduplication without modifying the output format.

If we want to log the hit counts, then make another function that does 
what you did and use it for logging.


> On Tue, Jun 4, 2019, 03:23 Henrik K <he...@hege.li> wrote:
>
>>
>> If you want to modify debug output, you have to modify only the dbg()
>> output
>> itself.  You can't modify internal functions that have specific output
>> formats and start adding random strings to them.  Atleast these places
>> depend on the comma delimited rules:
>>
>> ./masses/mass-check:    push @tests, split(/,/,
>> $status->get_names_of_subtests_hit());
>> ./t/rule_tests.t:    my %rules_hit = map { $_ => 1 }
>> split(/,/,$msg->get_names_of_tests_hit()),
>> split(/,/,$msg->get_names_of_subtests_hit());
>> ./t.rules/run:  my $testsline =
>> $status->get_names_of_tests_hit().",".$status->get_names_of_subtests_hit();
>>
>>
>>
>> On Tue, Jun 04, 2019 at 01:56:26AM -0400, Kevin A. McGrail wrote:
>>> Morning All,
>>>
>>> After a few thoughts on limits, it appears that any duplicate subtest
>>> hits are best combined for debug output.
>>>
>>> Any thoughts on the attached?  It looks like it will help me with rule
>>> development while support rules with valid but large maxhits like
>> __LOWER_E
>>>
>>> Regards,
>>> KAM
>>>
>>> On 5/31/2019 10:30 AM, Bill Cole wrote:
>>>> On 30 May 2019, at 20:35, Kevin A. McGrail wrote:
>>>>
>>>>> I was curious if anyone noticed the debug output for subtests has
>> gotten
>>>>> insane:
>>>>
>>>> It got a little discussion on users@ when I created those rules.
>>>>
>>>> [...]
>>>>
>>>>> 72_active.cf:    body            __LOWER_E       /e/
>>>>> 72_active.cf:    tflags          __LOWER_E       multiple maxhits=230
>>>>>
>>>>> 72_active.cf:    body            __E_LIKE_LETTER /<lcase_e>/
>>>>> 72_active.cf:    tflags          __E_LIKE_LETTER multiple maxhits=320
>>>>>
>>>>> Assuming those maxhits are correct,
>>>>
>>>> They are. In fact they were carefully tuned to catch the targeted
>>>> extortion spam.
>>>>
>>>>> maybe we need something in the debug
>>>>> output that says __E_LIKE_LETTER (number of hits if more than 1).
>>>>
>>>> That would be a useful enhancement even without my flagrant log
>>>> vandalism.
>>>>
>>>
>>> --
>>> Kevin A. McGrail
>>> Member, Apache Software Foundation
>>> Chair Emeritus Apache SpamAssassin Project
>>> https://www.linkedin.com/in/kmcgrail - 703.798.0171
>>>
>>
>>> Index: lib/Mail/SpamAssassin/PerMsgStatus.pm
>>> ===================================================================
>>> --- lib/Mail/SpamAssassin/PerMsgStatus.pm       (revision 1860582)
>>> +++ lib/Mail/SpamAssassin/PerMsgStatus.pm       (working copy)
>>> @@ -769,7 +769,38 @@
>>>  sub get_names_of_subtests_hit {
>>>    my ($self) = @_;
>>>
>>> -  return join(',', sort @{$self->{subtest_names_hit}});
>>> +  #return join(',', sort @{$self->{subtest_names_hit}});
>>> +
>>> +  #This routine prints only one instance of a subrule hit with a count
>> of how many times it hit if greater than 1
>>> +  my (%subtest_names_hit, $i, $key, @keys, @sorted, $string, $rule,
>> $total_hits, $deduplicated_hits);
>>> +
>>> +  $total_hits = scalar(@{$self->{subtest_names_hit}});
>>> +
>>> +  for ($i=0; $i < $total_hits; $i++) {
>>> +    $rule = ${$self->{subtest_names_hit}}[$i];
>>> +    $subtest_names_hit{$rule}++;
>>> +  }
>>> +
>>> +  foreach $key (keys %subtest_names_hit) {
>>> +    push (@keys, $key);
>>> +  }
>>> +  @sorted = sort @keys;
>>> +
>>> +  $deduplicated_hits = scalar(@sorted);
>>> +
>>> +  for ($i=0; $i < $deduplicated_hits; $i++) {
>>> +    $string .= $sorted[$i];
>>> +    if ($subtest_names_hit{$sorted[$i]} > 1) {
>>> +      $string .= "($subtest_names_hit{$sorted[$i]})"
>>> +    }
>>> +    $string .= ",";
>>> +  }
>>> +
>>> +  $string =~ s/,$//;
>>> +
>>> +  $string .= " (Total Subtest Hits: $total_hits / Deduplicated Total
>> Hits: $deduplicated_hits)";
>>> +
>>> +  return $string;
>>>  }
>>>
>>>
>> ###########################################################################
>>
>>
>

-- 
  John Hardin KA7OHZ                    http://www.impsec.org/~jhardin/
  jhardin@impsec.org    FALaholic #11174     pgpk -a jhardin@impsec.org
  key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
-----------------------------------------------------------------------
   ...we no longer live in a nation of laws. If we did,
   [Hillary Clinton] wouldn't be running for office,
   she'd be running for Mexico.                        -- Bill Whittle
   (or somewhere that does not have an extradition treaty with the US)
-----------------------------------------------------------------------
  2 days until the 75th anniversary of D-Day

Re: __E_LIKE_LETTER & __LOWER_E filling subtests debug

Posted by "Kevin A. McGrail" <km...@apache.org>.
Yes, I was thinking about that and wanting to fix uritests so well for the
template.   Thanks for the feedback.  I will take another pass at it.

On Tue, Jun 4, 2019, 03:23 Henrik K <he...@hege.li> wrote:

>
> If you want to modify debug output, you have to modify only the dbg()
> output
> itself.  You can't modify internal functions that have specific output
> formats and start adding random strings to them.  Atleast these places
> depend on the comma delimited rules:
>
> ./masses/mass-check:    push @tests, split(/,/,
> $status->get_names_of_subtests_hit());
> ./t/rule_tests.t:    my %rules_hit = map { $_ => 1 }
> split(/,/,$msg->get_names_of_tests_hit()),
> split(/,/,$msg->get_names_of_subtests_hit());
> ./t.rules/run:  my $testsline =
> $status->get_names_of_tests_hit().",".$status->get_names_of_subtests_hit();
>
>
>
> On Tue, Jun 04, 2019 at 01:56:26AM -0400, Kevin A. McGrail wrote:
> > Morning All,
> >
> > After a few thoughts on limits, it appears that any duplicate subtest
> > hits are best combined for debug output.
> >
> > Any thoughts on the attached?  It looks like it will help me with rule
> > development while support rules with valid but large maxhits like
> __LOWER_E
> >
> > Regards,
> > KAM
> >
> > On 5/31/2019 10:30 AM, Bill Cole wrote:
> > > On 30 May 2019, at 20:35, Kevin A. McGrail wrote:
> > >
> > >> I was curious if anyone noticed the debug output for subtests has
> gotten
> > >> insane:
> > >
> > > It got a little discussion on users@ when I created those rules.
> > >
> > > [...]
> > >
> > >> 72_active.cf:    body            __LOWER_E       /e/
> > >> 72_active.cf:    tflags          __LOWER_E       multiple maxhits=230
> > >>
> > >> 72_active.cf:    body            __E_LIKE_LETTER /<lcase_e>/
> > >> 72_active.cf:    tflags          __E_LIKE_LETTER multiple maxhits=320
> > >>
> > >> Assuming those maxhits are correct,
> > >
> > > They are. In fact they were carefully tuned to catch the targeted
> > > extortion spam.
> > >
> > >> maybe we need something in the debug
> > >> output that says __E_LIKE_LETTER (number of hits if more than 1).
> > >
> > > That would be a useful enhancement even without my flagrant log
> > > vandalism.
> > >
> >
> > --
> > Kevin A. McGrail
> > Member, Apache Software Foundation
> > Chair Emeritus Apache SpamAssassin Project
> > https://www.linkedin.com/in/kmcgrail - 703.798.0171
> >
>
> > Index: lib/Mail/SpamAssassin/PerMsgStatus.pm
> > ===================================================================
> > --- lib/Mail/SpamAssassin/PerMsgStatus.pm       (revision 1860582)
> > +++ lib/Mail/SpamAssassin/PerMsgStatus.pm       (working copy)
> > @@ -769,7 +769,38 @@
> >  sub get_names_of_subtests_hit {
> >    my ($self) = @_;
> >
> > -  return join(',', sort @{$self->{subtest_names_hit}});
> > +  #return join(',', sort @{$self->{subtest_names_hit}});
> > +
> > +  #This routine prints only one instance of a subrule hit with a count
> of how many times it hit if greater than 1
> > +  my (%subtest_names_hit, $i, $key, @keys, @sorted, $string, $rule,
> $total_hits, $deduplicated_hits);
> > +
> > +  $total_hits = scalar(@{$self->{subtest_names_hit}});
> > +
> > +  for ($i=0; $i < $total_hits; $i++) {
> > +    $rule = ${$self->{subtest_names_hit}}[$i];
> > +    $subtest_names_hit{$rule}++;
> > +  }
> > +
> > +  foreach $key (keys %subtest_names_hit) {
> > +    push (@keys, $key);
> > +  }
> > +  @sorted = sort @keys;
> > +
> > +  $deduplicated_hits = scalar(@sorted);
> > +
> > +  for ($i=0; $i < $deduplicated_hits; $i++) {
> > +    $string .= $sorted[$i];
> > +    if ($subtest_names_hit{$sorted[$i]} > 1) {
> > +      $string .= "($subtest_names_hit{$sorted[$i]})"
> > +    }
> > +    $string .= ",";
> > +  }
> > +
> > +  $string =~ s/,$//;
> > +
> > +  $string .= " (Total Subtest Hits: $total_hits / Deduplicated Total
> Hits: $deduplicated_hits)";
> > +
> > +  return $string;
> >  }
> >
> >
> ###########################################################################
>
>

Re: __E_LIKE_LETTER & __LOWER_E filling subtests debug

Posted by Henrik K <he...@hege.li>.
If you want to modify debug output, you have to modify only the dbg() output
itself.  You can't modify internal functions that have specific output
formats and start adding random strings to them.  Atleast these places
depend on the comma delimited rules:

./masses/mass-check:    push @tests, split(/,/, $status->get_names_of_subtests_hit());
./t/rule_tests.t:    my %rules_hit = map { $_ => 1 } split(/,/,$msg->get_names_of_tests_hit()), split(/,/,$msg->get_names_of_subtests_hit());
./t.rules/run:  my $testsline = $status->get_names_of_tests_hit().",".$status->get_names_of_subtests_hit();



On Tue, Jun 04, 2019 at 01:56:26AM -0400, Kevin A. McGrail wrote:
> Morning All,
> 
> After a few thoughts on limits, it appears that any duplicate subtest
> hits are best combined for debug output.
> 
> Any thoughts on the attached?  It looks like it will help me with rule
> development while support rules with valid but large maxhits like __LOWER_E
> 
> Regards,
> KAM
> 
> On 5/31/2019 10:30 AM, Bill Cole wrote:
> > On 30 May 2019, at 20:35, Kevin A. McGrail wrote:
> >
> >> I was curious if anyone noticed the debug output for subtests has gotten
> >> insane:
> >
> > It got a little discussion on users@ when I created those rules.
> >
> > [...]
> >
> >> 72_active.cf:    body            __LOWER_E       /e/
> >> 72_active.cf:    tflags          __LOWER_E       multiple maxhits=230
> >>
> >> 72_active.cf:    body            __E_LIKE_LETTER /<lcase_e>/
> >> 72_active.cf:    tflags          __E_LIKE_LETTER multiple maxhits=320
> >>
> >> Assuming those maxhits are correct,
> >
> > They are. In fact they were carefully tuned to catch the targeted
> > extortion spam.
> >
> >> maybe we need something in the debug
> >> output that says __E_LIKE_LETTER (number of hits if more than 1).
> >
> > That would be a useful enhancement even without my flagrant log
> > vandalism.
> >
> 
> -- 
> Kevin A. McGrail
> Member, Apache Software Foundation
> Chair Emeritus Apache SpamAssassin Project
> https://www.linkedin.com/in/kmcgrail - 703.798.0171
> 

> Index: lib/Mail/SpamAssassin/PerMsgStatus.pm
> ===================================================================
> --- lib/Mail/SpamAssassin/PerMsgStatus.pm       (revision 1860582)
> +++ lib/Mail/SpamAssassin/PerMsgStatus.pm       (working copy)
> @@ -769,7 +769,38 @@
>  sub get_names_of_subtests_hit {
>    my ($self) = @_;
>  
> -  return join(',', sort @{$self->{subtest_names_hit}});
> +  #return join(',', sort @{$self->{subtest_names_hit}});
> +
> +  #This routine prints only one instance of a subrule hit with a count of how many times it hit if greater than 1
> +  my (%subtest_names_hit, $i, $key, @keys, @sorted, $string, $rule, $total_hits, $deduplicated_hits);  
> +
> +  $total_hits = scalar(@{$self->{subtest_names_hit}});
> +
> +  for ($i=0; $i < $total_hits; $i++) {
> +    $rule = ${$self->{subtest_names_hit}}[$i]; 
> +    $subtest_names_hit{$rule}++; 
> +  }
> +
> +  foreach $key (keys %subtest_names_hit) {
> +    push (@keys, $key);
> +  }
> +  @sorted = sort @keys;
> +
> +  $deduplicated_hits = scalar(@sorted);
> +
> +  for ($i=0; $i < $deduplicated_hits; $i++) {
> +    $string .= $sorted[$i];
> +    if ($subtest_names_hit{$sorted[$i]} > 1) {
> +      $string .= "($subtest_names_hit{$sorted[$i]})"
> +    }
> +    $string .= ",";
> +  }
> +
> +  $string =~ s/,$//;
> +
> +  $string .= " (Total Subtest Hits: $total_hits / Deduplicated Total Hits: $deduplicated_hits)";
> +
> +  return $string;
>  }
>  
>  ###########################################################################


Re: __E_LIKE_LETTER & __LOWER_E filling subtests debug

Posted by "Kevin A. McGrail" <km...@apache.org>.
Morning All,

After a few thoughts on limits, it appears that any duplicate subtest
hits are best combined for debug output.

Any thoughts on the attached?  It looks like it will help me with rule
development while support rules with valid but large maxhits like __LOWER_E

Regards,
KAM

On 5/31/2019 10:30 AM, Bill Cole wrote:
> On 30 May 2019, at 20:35, Kevin A. McGrail wrote:
>
>> I was curious if anyone noticed the debug output for subtests has gotten
>> insane:
>
> It got a little discussion on users@ when I created those rules.
>
> [...]
>
>> 72_active.cf:    body            __LOWER_E       /e/
>> 72_active.cf:    tflags          __LOWER_E       multiple maxhits=230
>>
>> 72_active.cf:    body            __E_LIKE_LETTER /<lcase_e>/
>> 72_active.cf:    tflags          __E_LIKE_LETTER multiple maxhits=320
>>
>> Assuming those maxhits are correct,
>
> They are. In fact they were carefully tuned to catch the targeted
> extortion spam.
>
>> maybe we need something in the debug
>> output that says __E_LIKE_LETTER (number of hits if more than 1).
>
> That would be a useful enhancement even without my flagrant log
> vandalism.
>

-- 
Kevin A. McGrail
Member, Apache Software Foundation
Chair Emeritus Apache SpamAssassin Project
https://www.linkedin.com/in/kmcgrail - 703.798.0171


Re: __E_LIKE_LETTER & __LOWER_E filling subtests debug

Posted by Bill Cole <sa...@billmail.scconsult.com>.
On 30 May 2019, at 20:35, Kevin A. McGrail wrote:

> I was curious if anyone noticed the debug output for subtests has 
> gotten
> insane:

It got a little discussion on users@ when I created those rules.

[...]

> 72_active.cf:    body            __LOWER_E       
> /e/
> 72_active.cf:    tflags          __LOWER_E       
> multiple maxhits=230
>
> 72_active.cf:    body            __E_LIKE_LETTER 
> /<lcase_e>/
> 72_active.cf:    tflags          __E_LIKE_LETTER multiple 
> maxhits=320
>
> Assuming those maxhits are correct,

They are. In fact they were carefully tuned to catch the targeted 
extortion spam.

> maybe we need something in the debug
> output that says __E_LIKE_LETTER (number of hits if more than 1).

That would be a useful enhancement even without my flagrant log 
vandalism.

-- 
Bill Cole
bill@scconsult.com or billcole@apache.org
(AKA @grumpybozo and many *@billmail.scconsult.com addresses)
Not Currently Available For Hire