You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@spamassassin.apache.org by Matus UHLAR - fantomas <uh...@fantomas.sk> on 2019/12/06 09:23:15 UTC

SA memory (Re: ".*" in body rules)

>On Thu, 5 Dec 2019 17:07:05 +0100
>Matus UHLAR - fantomas wrote:
>> seems some big mails were too long to scan, and SA even got killed.
>>
>> [2146809.213586] Out of memory: Kill process 3660 (spamassassin)
>> score 365 or sacrifice child [2146809.213613] Killed process 3660
>> (spamassassin) total-vm:2960664kB, anon-rss:2921892kB, file-rss:0kB,
>> shmem-rss:0kB [2146809.270342] oom_reaper: reaped process 3660
>> (spamassassin), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
>>
>> I see the mail body contains nearly 20MB uuencoded text (don't ask).
>>
>> I found some body rules that contain ".*" instead of a sane
>> quantifier:
>>
>> 72_active.cf:rawbody            __HAS_HREF      /^[^>].*?<a href=/im
>> 72_active.cf:rawbody            __HAS_HREF_ONECASE      /^[^>].*?<(a
>> href|A HREF)=/m 72_active.cf:rawbody            __HAS_IMG_SRC
>> /^[^>].*?<img src=/im 72_active.cf:rawbody  __HAS_IMG_SRC_DATA
>> /^[^>].*?<img src=['"]data/im 72_active.cf:rawbody
>> __HAS_IMG_SRC_ONECASE   /^[^>].*?<(img src|IMG SRC)=/m
>>
>> There are different checks that have the "*" quantifier tho.
>> Is it reasonable to replace them with {0,1000} globally?

On 05.12.19 17:21, RW wrote:
>In rawbody rules the text is broken into chunks of 1024 to 2048 bytes,
>so the worst case isn't all that much worst than with {0,1000}.
>
>Also  /m means that .* wont cross a line boundary in the decoded text
>and  ^ can match in the middle of the chunk. This make the average
>processing  time less sensitive to any upper limit on .*.

so it is not the quantifiers who cause SA taking too much of memory?

any idea how to debug that?

-- 
Matus UHLAR - fantomas, uhlar@fantomas.sk ; http://www.fantomas.sk/
Warning: I wish NOT to receive e-mail advertising to this address.
Varovanie: na tuto adresu chcem NEDOSTAVAT akukolvek reklamnu postu.
The 3 biggets disasters: Hiroshima 45, Tschernobyl 86, Windows 95

Re: SA memory (Re: ".*" in body rules)

Posted by Nix <ni...@esperi.org.uk>.

On 6 Dec 2019, Henrik K. spake thusly:

> On Fri, Dec 06, 2019 at 10:23:15AM +0100, Matus UHLAR - fantomas wrote:
>> >On Thu, 5 Dec 2019 17:07:05 +0100
>> >Matus UHLAR - fantomas wrote:
>> >>seems some big mails were too long to scan, and SA even got killed.
>> >>
>> >>[2146809.213586] Out of memory: Kill process 3660 (spamassassin)
>> >>score 365 or sacrifice child [2146809.213613] Killed process 3660
>> >>(spamassassin) total-vm:2960664kB, anon-rss:2921892kB, file-rss:0kB,
>> >>shmem-rss:0kB [2146809.270342] oom_reaper: reaped process 3660
>> >>(spamassassin), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
>> >>
>> >>I see the mail body contains nearly 20MB uuencoded text (don't ask).
>> >>
>> >>I found some body rules that contain ".*" instead of a sane
>> >>quantifier:
>> >>
>> >>72_active.cf:rawbody            __HAS_HREF      /^[^>].*?<a href=/im
>> >>72_active.cf:rawbody            __HAS_HREF_ONECASE      /^[^>].*?<(a
>> >>href|A HREF)=/m 72_active.cf:rawbody            __HAS_IMG_SRC
>> >>/^[^>].*?<img src=/im 72_active.cf:rawbody  __HAS_IMG_SRC_DATA
>> >>/^[^>].*?<img src=['"]data/im 72_active.cf:rawbody
>> >>__HAS_IMG_SRC_ONECASE   /^[^>].*?<(img src|IMG SRC)=/m
>> >>
>> >>There are different checks that have the "*" quantifier tho.
>> >>Is it reasonable to replace them with {0,1000} globally?
>> 
>> On 05.12.19 17:21, RW wrote:
>> >In rawbody rules the text is broken into chunks of 1024 to 2048 bytes,
>> >so the worst case isn't all that much worst than with {0,1000}.
>> >
>> >Also  /m means that .* wont cross a line boundary in the decoded text
>> >and  ^ can match in the middle of the chunk. This make the average
>> >processing  time less sensitive to any upper limit on .*.
>> 
>> so it is not the quantifiers who cause SA taking too much of memory?
>> 
>> any idea how to debug that?
>
> Scanning a generic 20MB will normally eat ~700MB memory.  3GB implies
> something is bugging.  Feel free to send a sample if you can.

Yeah. Similarly, a standard sa-learn run blew up to 96GiB RSS and got
oom-killed here, just last night. That this happened to two people
simultaneously suggests that something bad crept in in the 4th Dec rule
update... I'll see if I can replicate it with -D on. (If sa-learn even
pays attention to -D :) )

-- 
NULL && (void)

Re: SA memory (Re: ".*" in body rules)

Posted by Henrik K <he...@hege.li>.

On Wed, Dec 11, 2019 at 01:58:03PM +0100, Matus UHLAR - fantomas wrote:
>
> My question was, if there's a bug in the bayes code, causing it to eat too
> much of memory.  Both ~750B per token with file-based bayes or ~600B per
> token in redis-based BAYES looks like too much for me.

Not so much a bug, but we should probably add some internal limit to parsed
tokens (10000?) - a normal message would not contain more tokens.  At those
counts the per token memory usage is irrelevant (but we could look at
optimizing it too).  Just need to be careful not to create a loophole for
spammers (filling up few 50k parts with random short tokens, so last part
won't be tokenized at all?)

Created a bug so it won't be forgotten:
https://bz.apache.org/SpamAssassin/show_bug.cgi?id=7776

Re: SA memory (Re: ".*" in body rules)

Posted by Matus UHLAR - fantomas <uh...@fantomas.sk>.

>> >On Wed, Dec 11, 2019 at 10:53:04AM +0100, Matus UHLAR - fantomas wrote:
>> >>On 11.12.19 11:43, Henrik K wrote:
>> >>>Wow 6 million tokens.. :-)
>> >>>
>> >>>I assume the big uuencoded blob content-type is text/* since it's tokenized?
>>
>> >>yes, I mentioned that in previous mails. ~15M file, uuencoded in ~20M mail.
>> >>
>> >>grep -c '^M' spamassassin-memory-error-<...>
>> >>329312
>> >>
>> >>One of former mails mentioned that 20M mail should use ~700M of RAM. 6M
>> >>tokens eating about 4G of RAM means ~750B per token, is that fine?
>>
>> On 11.12.19 12:07, Henrik K wrote:
>> >I'm pretty sure the Bayes code does many dumb things with the tokens
>> >that result in much memory usage for abnormal cases like this.

>On Wed, Dec 11, 2019 at 01:12:46PM +0100, Matus UHLAR - fantomas wrote:
>> but apparently nobody notices...

On 11.12.19 14:22, Henrik K wrote:
>How many people even scan 20MB mails?  Pretty much nobody.  It's not safe to
>do until SA 3.4.3 version as you can see.  Before this, I know atleast
>Amavisd-new could be configured to truncate large messages before feeding to
>SA, which was somewhat safe to do.

I have raised the limits years ago to see how it goes.  During the time, I
have received multiple many-MB spams, most of them hit BAYES_99, and without
it they would became FNs.

This is about the second time it caused problems - the first first time
happened on very slow machine, scanning took too much time.

My question was, if there's a bug in the bayes code, causing it to eat too
much of memory.  Both ~750B per token with file-based bayes or ~600B per
token in redis-based BAYES looks like too much for me.


-- 
Matus UHLAR - fantomas, uhlar@fantomas.sk ; http://www.fantomas.sk/
Warning: I wish NOT to receive e-mail advertising to this address.
Varovanie: na tuto adresu chcem NEDOSTAVAT akukolvek reklamnu postu.
   One OS to rule them all, One OS to find them,
One OS to bring them all and into darkness bind them

Re: SA memory (Re: ".*" in body rules)

Posted by Henrik K <he...@hege.li>.

On Wed, Dec 11, 2019 at 01:12:46PM +0100, Matus UHLAR - fantomas wrote:
> >On Wed, Dec 11, 2019 at 10:53:04AM +0100, Matus UHLAR - fantomas wrote:
> >>On 11.12.19 11:43, Henrik K wrote:
> >>>Wow 6 million tokens.. :-)
> >>>
> >>>I assume the big uuencoded blob content-type is text/* since it's tokenized?
> 
> >>yes, I mentioned that in previous mails. ~15M file, uuencoded in ~20M mail.
> >>
> >>grep -c '^M' spamassassin-memory-error-<...>
> >>329312
> >>
> >>One of former mails mentioned that 20M mail should use ~700M of RAM. 6M
> >>tokens eating about 4G of RAM means ~750B per token, is that fine?
> 
> On 11.12.19 12:07, Henrik K wrote:
> >I'm pretty sure the Bayes code does many dumb things with the tokens
> >that result in much memory usage for abnormal cases like this.
> 
> but apparently nobody notices...

How many people even scan 20MB mails?  Pretty much nobody.  It's not safe to
do until SA 3.4.3 version as you can see.  Before this, I know atleast
Amavisd-new could be configured to truncate large messages before feeding to
SA, which was somewhat safe to do.

> >>>This will be mitigated in 3.4.3, since it will only use max 50k of the body
> >>>text (body_part_scan_size).
> 
> >>will it prefer test parts and try to avoid uuencoded or base64 parts?
> >>(or maybe decode them?)
> 
> >There is no change in how parts are processed.  As before, "body" is
> >concatenated result of all textual parts.  But in 3.4.3 atleast each part is
> >truncated to 50k.  If there are several parts then it's 50+50k etc..
> 
> I understand such change apparently should not be done in minor version.

It was decided to implement in 3.4.3 to fix things just like this, along
with the major CVE fixes.  Most likely people will use 3.4.3 until eternity. 
I don't know when 4.0 will be released and it will be surely adopted very
late by distributions.

Re: SA memory (Re: ".*" in body rules)

Posted by Matus UHLAR - fantomas <uh...@fantomas.sk>.

>On Wed, Dec 11, 2019 at 10:53:04AM +0100, Matus UHLAR - fantomas wrote:
>> On 11.12.19 11:43, Henrik K wrote:
>> >Wow 6 million tokens.. :-)
>> >
>> >I assume the big uuencoded blob content-type is text/* since it's tokenized?

>> yes, I mentioned that in previous mails. ~15M file, uuencoded in ~20M mail.
>>
>> grep -c '^M' spamassassin-memory-error-<...>
>> 329312
>>
>> One of former mails mentioned that 20M mail should use ~700M of RAM. 6M
>> tokens eating about 4G of RAM means ~750B per token, is that fine?

On 11.12.19 12:07, Henrik K wrote:
>I'm pretty sure the Bayes code does many dumb things with the tokens
>that result in much memory usage for abnormal cases like this.

but apparently nobody notices...

>> >This will be mitigated in 3.4.3, since it will only use max 50k of the body
>> >text (body_part_scan_size).

>> will it prefer test parts and try to avoid uuencoded or base64 parts?
>> (or maybe decode them?)

>There is no change in how parts are processed.  As before, "body" is
>concatenated result of all textual parts.  But in 3.4.3 atleast each part is
>truncated to 50k.  If there are several parts then it's 50+50k etc..

I understand such change apparently should not be done in minor version.

Well, I tried on currently unused machine with 16G of RAM, moved bayes DB
there (scanning on account without bayes was fast even on the original one,
with lower, maybe mentioned ~700M memory usage).

scanning took 17 minutes topping on 4.8G mem.

when I have tried to check with redis (copied bayes DB there), scanning
topped on 3.8G but took 29 minutes (???), even with repeated test.

I understand I probably push too far, but you never know in advance.

I also understand redis is great with parallel scanning.



I include logs from scanning on filesystem bayes, including places where
biggest differencies are:


Dec 11 10:45:42.261 [12972] dbg: logger: adding facilities: all
...
Dec 11 10:45:43.969 [12972] dbg: message: ---- MIME PARSER END ----
Dec 11 10:45:44.038 [12972] dbg: message: no encoding detected
Dec 11 10:46:10.379 [12972] dbg: plugin: Mail::SpamAssassin::Plugin::URIDNSBL=HASH(0x5617d8cf6c48) implements 'parsed_metadata', priority 0
Dec 11 10:46:23.131 [12972] dbg: uridnsbl: more than 20 URIs, picking a subset
...
Dec 11 10:46:23.272 [12972] dbg: async: starting: DNSBL-A, dns:A:70.175.80.195.iadb.isipp.com (timeout 15.0s, min 3.0s)
Dec 11 10:48:22.828 [12972] dbg: check: check_main, time limit in 1639.598 s
...
Dec 11 10:48:23.005 [12972] dbg: bayes: corpus size: nspam = 89264, nham = 17109
Dec 11 10:49:30.445 [12972] dbg: bayes: tokenized body: 6158242 tokens
Dec 11 10:49:35.335 [12972] dbg: bayes: tokenized uri: 10881 tokens
Dec 11 10:49:35.351 [12972] dbg: bayes: tokenized invisible: 0 tokens
Dec 11 10:49:35.354 [12972] dbg: bayes: tokenized header: 208 tokens
Dec 11 10:50:54.200 [12972] dbg: bayes: score = 0.5
...
Dec 11 10:50:54.202 [12972] dbg: check: tagrun - tag TOKENSUMMARY is now ready, value: CODE(0x5617de4969e8)
Dec 11 10:50:58.537 [12972] dbg: async: select found no responses ready (t.o.=0.0)
Dec 11 10:50:58.537 [12972] dbg: async: queries completed: 0, started: 0
Dec 11 10:50:58.537 [12972] dbg: async: queries active: DNSBL-A=4 DNSBL-TXT=2 URI-A=9 URI-DNSBL=20 URI-NS=10, all expired at Wed Dec 11 10:50:58 2019
Dec 11 10:51:01.653 [12972] dbg: rules: running rawbody tests; score so far=-0.699
...
Dec 11 10:51:02.711 [12972] dbg: rules: compiled body tests
Dec 11 10:51:08.066 [12972] dbg: rules: ran body rule __hk_bigmoney ======> got hit: "$NK7M"
Dec 11 10:52:00.372 [12972] dbg: rules: ran body rule __DRUGS_MUSCLE1 ======> got hit: "@S"'<0 MA[+*"
Dec 11 10:52:01.853 [12972] dbg: rules: ran body rule __LOTSA_MONEY_03 ======> got hit: "$3M"
Dec 11 10:52:01.886 [12972] dbg: rules: ran body rule __DOS_BODY_WED ======> got hit: "WED"
Dec 11 10:52:05.859 [12972] dbg: rules: ran body rule __LOTSA_MONEY_01 ======> got hit: "$94O0541"
Dec 11 10:52:31.895 [12972] dbg: rules: ran body rule __HAS_ANY_EMAIL ======> got hit: "a@nspnz.s"
Dec 11 10:52:53.298 [12972] dbg: rules: ran body rule __DOS_BODY_SUN ======> got hit: "SUN"
Dec 11 10:52:53.298 [12972] dbg: rules: ran body rule __DOS_BODY_TUE ======> got hit: "Tuesday"
Dec 11 10:53:01.629 [12972] dbg: rules: ran body rule __FIFTY_FIFTY ======> got hit: "50%"
Dec 11 10:53:04.870 [12972] dbg: rules: ran body rule __DOS_BODY_SAT ======> got hit: "SAT"
Dec 11 10:53:06.939 [12972] dbg: rules: ran body rule __DOS_BODY_FRI ======> got hit: "FRI"
Dec 11 10:53:06.951 [12972] dbg: rules: ran body rule __freemail_safe_fwd ======> got hit: "---Original Message"
Dec 11 10:56:14.590 [12972] dbg: rules: ran body rule __FRAUD_DBI ======> got hit: "$,, M"
Dec 11 10:56:58.462 [12972] dbg: rules: ran body rule __FB_COST ======> got hit: "COST"
Dec 11 10:57:02.611 [12972] dbg: rules: ran body rule FUZZY_PRICES ======> got hit: "PR!@*3Z"
Dec 11 10:57:07.993 [12972] dbg: rules: ran body rule WEIRD_QUOTING ======> got hit: """,`_'2""""
Dec 11 10:57:11.069 [12972] dbg: rules: ran body rule FUZZY_CPILL ======> got hit: "KYO11Z"
Dec 11 10:57:21.916 [12972] dbg: rules: ran body rule __LOTSA_MONEY_02 ======> got hit: "2,3O964$"
Dec 11 10:57:47.954 [12972] dbg: rules: ran body rule __DOS_BODY_THU ======> got hit: "THU"
Dec 11 10:58:03.551 [12972] dbg: rules: ran body rule __LOTSA_MONEY_04 ======> got hit: "1MN98USD"
Dec 11 10:58:10.930 [12972] dbg: rules: ran body rule __NONEMPTY_BODY ======> got hit: "R"
Dec 11 10:58:29.975 [12972] dbg: rules: ran body rule FUZZY_CREDIT ======> got hit: "CREDYT"
Dec 11 10:58:42.635 [12972] dbg: rules: ran body rule __FUZZY_DR_OZ ======> got hit: "DGC0S "
Dec 11 10:58:54.772 [12972] dbg: rules: ran body rule __DOS_BODY_TICKER ======> got hit: "MVYR.PK"
Dec 11 10:59:20.483 [12972] dbg: rules: ran body rule __FB_NUM_PERCNT ======> got hit: "0%"
Dec 11 10:59:20.490 [12972] dbg: rules: ran body rule __DOS_BODY_MON ======> got hit: "MON"
Dec 11 10:59:25.221 [12972] dbg: rules: ran body rule __BODY_TEXT_LINE ======> got hit: "R"
Dec 11 10:59:25.221 [12972] dbg: rules: ran body rule __BODY_TEXT_LINE ======> got hit: "M"
Dec 11 10:59:25.221 [12972] dbg: rules: ran body rule __BODY_TEXT_LINE ======> got hit: "<CF>"
Dec 11 11:00:24.968 [12972] dbg: rules: ran body rule FUZZY_XPILL ======> got hit: "X;#0NA%X"
Dec 11 11:02:29.635 [12972] dbg: dns: bgread: received 113 bytes from 10.51.1.14
...
Dec 11 11:02:31.471 [12972] dbg: rules: compiled rawbody tests
Dec 11 11:02:35.862 [12972] dbg: rules: ran rawbody rule __HTML_SINGLET ======> got hit: ">W<"
...
Dec 11 11:02:36.349 [12972] dbg: rules: [...] M5ULS"=>P"
Dec 11 11:02:37.267 [12972] dbg: async: select found no responses ready (t.o.=0.0)
...
Dec 11 11:02:37.281 [12972] dbg: check: ascii_text_illegal: matches >> Odoslan<e9> z iPhonu
Dec 11 11:02:38.039 [12972] dbg: async: select found no responses ready (t.o.=0.0)
...
Dec 11 11:02:38.064 [12972] dbg: dns: entering helper-app run mode
Dec 11 11:02:43.064 [12972] dbg: dns: leaving helper-app run mode
...
Dec 11 11:02:43.735 [12972] dbg: netset: cache trusted_networks hits/attempts: 8/10, 80.0 %
-- 
Matus UHLAR - fantomas, uhlar@fantomas.sk ; http://www.fantomas.sk/
Warning: I wish NOT to receive e-mail advertising to this address.
Varovanie: na tuto adresu chcem NEDOSTAVAT akukolvek reklamnu postu.
I feel like I'm diagonally parked in a parallel universe.

Re: SA memory (Re: ".*" in body rules)

Posted by Henrik K <he...@hege.li>.

On Wed, Dec 11, 2019 at 10:53:04AM +0100, Matus UHLAR - fantomas wrote:
> On 11.12.19 11:43, Henrik K wrote:
> >Wow 6 million tokens.. :-)
> >
> >I assume the big uuencoded blob content-type is text/* since it's tokenized?
> 
> yes, I mentioned that in previous mails. ~15M file, uuencoded in ~20M mail.
> 
> grep -c '^M' spamassassin-memory-error-<...>
> 329312
> 
> One of former mails mentioned that 20M mail should use ~700M of RAM. 6M
> tokens eating about 4G of RAM means ~750B per token, is that fine?

I'm pretty sure the Bayes code does many dumb things with the tokens
that result in much memory usage for abnormal cases like this.

> >This will be mitigated in 3.4.3, since it will only use max 50k of the body
> >text (body_part_scan_size).
> 
> will it prefer test parts and try to avoid uuencoded or base64 parts?
> (or maybe decode them?)

There is no change in how parts are processed.  As before, "body" is
concatenated result of all textual parts.  But in 3.4.3 atleast each part is
truncated to 50k.  If there are several parts then it's 50+50k etc..

Re: SA memory (Re: ".*" in body rules)

Posted by RW <rw...@googlemail.com>.

On Wed, 11 Dec 2019 08:59:53 -0500
Bill Cole wrote:

> It's infeasible and would be unwise to put a uucode detector/decoder 
> into SA. There's no limit to the corner cases that can arise with a 
> sloppy format that has an unknowable number of mostly unmaintained 
> implementations which uses lore and local habits in place of 
> specification.

I think that's overstating it.  It could probably be implemented so it
works or falls back to the status quo. But it's the status quo that's
most wrong.

I think it could have been a lot worst if the backend had been faster
and/or the the attachment smaller so that it completed. In which case
training could have swamped the Bayes database, expiring the useful
tokens. If Redis does its internal LRU expiry promptly, it's the most
vulnerable to this.

With manual training a separate limit can used in sa-learn or spamc.
With autotraining I think anything that's small enough to scan can be
trained.

Re: SA memory (Re: ".*" in body rules)

Posted by Bill Cole <sa...@billmail.scconsult.com>.

On 11 Dec 2019, at 7:58, Henrik K wrote:

> SA does not and should not do any kind of content decoding/mangling 
> for
> text/plain contents.

Minor point:

SA does (as it should) decode Base64 or Quoted-Printable text/* MIME 
parts to a canonical binary form of whatever character set is being 
used, e.g. UTF-8, Latin-1, etc.

It's infeasible and would be unwise to put a uucode detector/decoder 
into SA. There's no limit to the corner cases that can arise with a 
sloppy format that has an unknowable number of mostly unmaintained 
implementations which uses lore and local habits in place of 
specification.

-- 
Bill Cole
bill@scconsult.com or billcole@apache.org
(AKA @grumpybozo and many *@billmail.scconsult.com addresses)
Not For Hire (currently)

Re: SA memory (Re: ".*" in body rules)

Posted by Henrik K <he...@hege.li>.

On Wed, Dec 11, 2019 at 01:44:35PM +0100, Matus UHLAR - fantomas wrote:
> 
> old school "attachment" no Content-Type, plaintext, uuencode inline.
> I don't think SA decodes that.
> I don't know if it should, but at least detection should be OK.

When Content-Type is missing, part is assumed to be text/plain.

SA does not and should not do any kind of content decoding/mangling for
text/plain contents.  Even if it looked like something that.

As a small exception to this, previously SA tried to decode HTML if
heuristics found HTML at the beginning of a text/plain part.  But this was
scrapped in 3.4.3, since no MUA actually does that these days, that was
decade old code..

But everything depends on what popular MUAs do.  If they suddenly decide to
start decoding some random text contents, then SA should be modified as
such.

Re: SA memory (Re: ".*" in body rules)

Posted by Matus UHLAR - fantomas <uh...@fantomas.sk>.

>On Wed, Dec 11, 2019 at 10:53:04AM +0100, Matus UHLAR - fantomas wrote:
>> will it prefer test parts and try to avoid uuencoded or base64 parts?
>> (or maybe decode them?)

On 11.12.19 14:35, Henrik K wrote:
>To clarify, of course SA decodes base64 parts.  Base64 is standard MIME
>transfer encoding.  It's decoded to reveal the actual part content.  But
>this is independent of what the actual Content-Type is declared as.  SA
>basically looks for Content-Type: text/* parts, and decoding stuff when
>required.
>
>Now whether your message was actually MIME transfer encoded as uuencode, I
>don't know?  Most likely it was just a text part with actual uuencoded text
>contents?  Or did your message have Content-Transfer-Encoding: x-uuencode
>header?  I don't even know if SA knows how to decode that..

old school "attachment" no Content-Type, plaintext, uuencode inline.
I don't think SA decodes that.
I don't know if it should, but at least detection should be OK.

-- 
Matus UHLAR - fantomas, uhlar@fantomas.sk ; http://www.fantomas.sk/
Warning: I wish NOT to receive e-mail advertising to this address.
Varovanie: na tuto adresu chcem NEDOSTAVAT akukolvek reklamnu postu.
I just got lost in thought. It was unfamiliar territory.

Re: SA memory (Re: ".*" in body rules)

Posted by Henrik K <he...@hege.li>.

On Wed, Dec 11, 2019 at 10:53:04AM +0100, Matus UHLAR - fantomas wrote:
> 
> will it prefer test parts and try to avoid uuencoded or base64 parts?
> (or maybe decode them?)

To clarify, of course SA decodes base64 parts.  Base64 is standard MIME
transfer encoding.  It's decoded to reveal the actual part content.  But
this is independent of what the actual Content-Type is declared as.  SA
basically looks for Content-Type: text/* parts, and decoding stuff when
required.

Now whether your message was actually MIME transfer encoded as uuencode, I
don't know?  Most likely it was just a text part with actual uuencoded text
contents?  Or did your message have Content-Transfer-Encoding: x-uuencode
header?  I don't even know if SA knows how to decode that..

Re: SA memory (Re: ".*" in body rules)

Posted by Matus UHLAR - fantomas <uh...@fantomas.sk>.

>On Wed, Dec 11, 2019 at 10:04:56AM +0100, Matus UHLAR - fantomas wrote:
>> >>hmmm, the machine has 4G of RAM and SA now takes 4.5.
>> >>The check rund out of time but produces ~450K debug file.
>> >>
>> >>This is where it hangs:
>> >>
>> >>Dec 10 17:43:51.727 [9721] dbg: bayes: tokenized header: 211 tokens
>>
>> On 10.12.19 22:52, RW wrote:
>> >What are the full counts if you put it through 'grep tokenized'
>>
>> Dec 10 17:43:49.137 [9721] dbg: bayes: tokenized body: 6158242 tokens
>> Dec 10 17:43:51.713 [9721] dbg: bayes: tokenized uri: 10881 tokens

On 11.12.19 11:43, Henrik K wrote:
>Wow 6 million tokens.. :-)
>
>I assume the big uuencoded blob content-type is text/* since it's tokenized?

yes, I mentioned that in previous mails. ~15M file, uuencoded in ~20M mail.

grep -c '^M' spamassassin-memory-error-<...>
329312

One of former mails mentioned that 20M mail should use ~700M of RAM. 
6M tokens eating about 4G of RAM means ~750B per token, is that fine?

>This will be mitigated in 3.4.3, since it will only use max 50k of the body
>text (body_part_scan_size).

will it prefer test parts and try to avoid uuencoded or base64 parts?
(or maybe decode them?)

-- 
Matus UHLAR - fantomas, uhlar@fantomas.sk ; http://www.fantomas.sk/
Warning: I wish NOT to receive e-mail advertising to this address.
Varovanie: na tuto adresu chcem NEDOSTAVAT akukolvek reklamnu postu.
Enter any 12-digit prime number to continue.

Re: SA memory (Re: ".*" in body rules)

Posted by Henrik K <he...@hege.li>.

On Wed, Dec 11, 2019 at 10:04:56AM +0100, Matus UHLAR - fantomas wrote:
> >>hmmm, the machine has 4G of RAM and SA now takes 4.5.
> >>The check rund out of time but produces ~450K debug file.
> >>
> >>This is where it hangs:
> >>
> >>Dec 10 17:43:51.727 [9721] dbg: bayes: tokenized header: 211 tokens
> 
> On 10.12.19 22:52, RW wrote:
> >What are the full counts if you put it through 'grep tokenized'
> 
> Dec 10 17:43:49.137 [9721] dbg: bayes: tokenized body: 6158242 tokens
> Dec 10 17:43:51.713 [9721] dbg: bayes: tokenized uri: 10881 tokens

Wow 6 million tokens.. :-)

I assume the big uuencoded blob content-type is text/* since it's tokenized?

This will be mitigated in 3.4.3, since it will only use max 50k of the body
text (body_part_scan_size).

Re: SA memory (Re: ".*" in body rules)

Posted by Matus UHLAR - fantomas <uh...@fantomas.sk>.

>> hmmm, the machine has 4G of RAM and SA now takes 4.5.
>> The check rund out of time but produces ~450K debug file.
>>
>> This is where it hangs:
>>
>> Dec 10 17:43:51.727 [9721] dbg: bayes: tokenized header: 211 tokens

On 10.12.19 22:52, RW wrote:
>What are the full counts if you put it through 'grep tokenized'

Dec 10 17:43:49.137 [9721] dbg: bayes: tokenized body: 6158242 tokens
Dec 10 17:43:51.713 [9721] dbg: bayes: tokenized uri: 10881 tokens
Dec 10 17:43:51.724 [9721] dbg: bayes: tokenized invisible: 0 tokens
Dec 10 17:43:51.727 [9721] dbg: bayes: tokenized header: 211 tokens

-- 
Matus UHLAR - fantomas, uhlar@fantomas.sk ; http://www.fantomas.sk/
Warning: I wish NOT to receive e-mail advertising to this address.
Varovanie: na tuto adresu chcem NEDOSTAVAT akukolvek reklamnu postu.
"To Boot or not to Boot, that's the question." [WD1270 Caviar]

Re: SA memory (Re: ".*" in body rules)

Posted by RW <rw...@googlemail.com>.

> hmmm, the machine has 4G of RAM and SA now takes 4.5. 
> The check rund out of time but produces ~450K debug file.
> 
> This is where it hangs:
> 
> Dec 10 17:43:51.727 [9721] dbg: bayes: tokenized header: 211 tokens

What are the full counts if you put it through 'grep tokenized'

Re: SA memory (Re: ".*" in body rules)

Posted by Matus UHLAR - fantomas <uh...@fantomas.sk>.

>On Mon, Dec 09, 2019 at 10:54:00AM +0100, Matus UHLAR - fantomas wrote:
>> I'm afraid I can't provide clients' file.
>>
>> I can only repeat:
>> - the mail is 20424329 bytes
>> - the mail contains single uuencoded .rar file inline.
>>
>> -rw-rw-rw- 1 root root 14818832 Dec  9 10:50 'redacted.rar'
>>
>> I have tried to run it again, it took about 20minutes to scan and memory
>> usage slowly increased up to:
>>
>>  PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
>> 1924 amavis    20   0 3916332   2.8g   1468 D   1.0  72.2   3:08.08 spamassassin
>>
>> note the "amavis" is the spamassassin command line client running under
>> amavis user to use amavis' bayes database:
>>
>> amavis    1924 24.8 72.9 3916332 2923964 ?     D    10:23   3:08 /usr/bin/perl -T -w /usr/bin/spamassassin -x
>>
>> -rw------- 1 amavis amavis 10584064 Dec  9 10:45 bayes_seen
>> -rw------- 1 amavis amavis 10760192 Dec  9 10:45 bayes_toks
>>
>> I have tried to attach the proces using strace, after a while it produced
>> output (only 2 rules hit), and exited.  I hope this didn't cause premature
>> exit of the SA client.

On 09.12.19 12:07, Henrik K wrote:
>And what does running spamassassin debug directly from command line output?
>Where does it hang?
>
>spamassassin -t -D < message >/dev/null

hmmm, the machine has 4G of RAM and SA now takes 4.5. 
The check rund out of time but produces ~450K debug file.

This is where it hangs:

Dec 10 17:43:51.727 [9721] dbg: bayes: tokenized header: 211 tokens
Dec 10 17:50:16.111 [9721] info: check: exceeded time limit in Mail::SpamAssassin::Plugin::Check::_eval_tests_type11_prineg90_set3, skipping further tests

I guess it's just the slowness of bayes checking (haven't tried redis)

but It still doesn't explain why it takes that much RAM, does it?

I can try on machine with more RAM, hopefully it'll help.

-- 
Matus UHLAR - fantomas, uhlar@fantomas.sk ; http://www.fantomas.sk/
Warning: I wish NOT to receive e-mail advertising to this address.
Varovanie: na tuto adresu chcem NEDOSTAVAT akukolvek reklamnu postu.
Silvester Stallone: Father of the RISC concept.

Re: SA memory (Re: ".*" in body rules)

Posted by Henrik K <he...@hege.li>.

On Mon, Dec 09, 2019 at 10:54:00AM +0100, Matus UHLAR - fantomas wrote:
> 
> I'm afraid I can't provide clients' file.
> 
> I can only repeat:
> - the mail is 20424329 bytes
> - the mail contains single uuencoded .rar file inline.
> 
> -rw-rw-rw- 1 root root 14818832 Dec  9 10:50 'redacted.rar'
> 
> I have tried to run it again, it took about 20minutes to scan and memory
> usage slowly increased up to:
> 
>  PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
> 1924 amavis    20   0 3916332   2.8g   1468 D   1.0  72.2   3:08.08 spamassassin
> 
> note the "amavis" is the spamassassin command line client running under
> amavis user to use amavis' bayes database:
> 
> amavis    1924 24.8 72.9 3916332 2923964 ?     D    10:23   3:08 /usr/bin/perl -T -w /usr/bin/spamassassin -x
> 
> -rw------- 1 amavis amavis 10584064 Dec  9 10:45 bayes_seen
> -rw------- 1 amavis amavis 10760192 Dec  9 10:45 bayes_toks
> 
> I have tried to attach the proces using strace, after a while it produced
> output (only 2 rules hit), and exited.  I hope this didn't cause premature
> exit of the SA client.

And what does running spamassassin debug directly from command line output? 
Where does it hang?

spamassassin -t -D < message >/dev/null

Re: SA memory (Re: ".*" in body rules)

Posted by Matus UHLAR - fantomas <uh...@fantomas.sk>.

>> >On Thu, 5 Dec 2019 17:07:05 +0100
>> >Matus UHLAR - fantomas wrote:
>> >>seems some big mails were too long to scan, and SA even got killed.
>> >>
>> >>[2146809.213586] Out of memory: Kill process 3660 (spamassassin)
>> >>score 365 or sacrifice child [2146809.213613] Killed process 3660
>> >>(spamassassin) total-vm:2960664kB, anon-rss:2921892kB, file-rss:0kB,
>> >>shmem-rss:0kB [2146809.270342] oom_reaper: reaped process 3660
>> >>(spamassassin), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
>> >>
>> >>I see the mail body contains nearly 20MB uuencoded text (don't ask).

>> On 05.12.19 17:21, RW wrote:
>> >In rawbody rules the text is broken into chunks of 1024 to 2048 bytes,
>> >so the worst case isn't all that much worst than with {0,1000}.
>> >
>> >Also  /m means that .* wont cross a line boundary in the decoded text
>> >and  ^ can match in the middle of the chunk. This make the average
>> >processing  time less sensitive to any upper limit on .*.

>On Fri, Dec 06, 2019 at 10:23:15AM +0100, Matus UHLAR - fantomas wrote:
>> so it is not the quantifiers who cause SA taking too much of memory?
>>
>> any idea how to debug that?

On 06.12.19 13:16, Henrik K wrote:
>Scanning a generic 20MB will normally eat ~700MB memory.  3GB implies
>something is bugging.  Feel free to send a sample if you can.

I'm afraid I can't provide clients' file.

I can only repeat:
- the mail is 20424329 bytes
- the mail contains single uuencoded .rar file inline.

-rw-rw-rw- 1 root root 14818832 Dec  9 10:50 'redacted.rar'

I have tried to run it again, it took about 20minutes to scan and memory
usage slowly increased up to:

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
 1924 amavis    20   0 3916332   2.8g   1468 D   1.0  72.2   3:08.08 spamassassin

note the "amavis" is the spamassassin command line client running under
amavis user to use amavis' bayes database:

amavis    1924 24.8 72.9 3916332 2923964 ?     D    10:23   3:08 /usr/bin/perl -T -w /usr/bin/spamassassin -x

-rw------- 1 amavis amavis 10584064 Dec  9 10:45 bayes_seen
-rw------- 1 amavis amavis 10760192 Dec  9 10:45 bayes_toks

I have tried to attach the proces using strace, after a while it produced
output (only 2 rules hit), and exited.  I hope this didn't cause premature
exit of the SA client.

-- 
Matus UHLAR - fantomas, uhlar@fantomas.sk ; http://www.fantomas.sk/
Warning: I wish NOT to receive e-mail advertising to this address.
Varovanie: na tuto adresu chcem NEDOSTAVAT akukolvek reklamnu postu.
How does cat play with mouse? cat /dev/mouse

Re: SA memory (Re: ".*" in body rules)

Posted by Henrik K <he...@hege.li>.

On Fri, Dec 06, 2019 at 10:23:15AM +0100, Matus UHLAR - fantomas wrote:
> >On Thu, 5 Dec 2019 17:07:05 +0100
> >Matus UHLAR - fantomas wrote:
> >>seems some big mails were too long to scan, and SA even got killed.
> >>
> >>[2146809.213586] Out of memory: Kill process 3660 (spamassassin)
> >>score 365 or sacrifice child [2146809.213613] Killed process 3660
> >>(spamassassin) total-vm:2960664kB, anon-rss:2921892kB, file-rss:0kB,
> >>shmem-rss:0kB [2146809.270342] oom_reaper: reaped process 3660
> >>(spamassassin), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
> >>
> >>I see the mail body contains nearly 20MB uuencoded text (don't ask).
> >>
> >>I found some body rules that contain ".*" instead of a sane
> >>quantifier:
> >>
> >>72_active.cf:rawbody            __HAS_HREF      /^[^>].*?<a href=/im
> >>72_active.cf:rawbody            __HAS_HREF_ONECASE      /^[^>].*?<(a
> >>href|A HREF)=/m 72_active.cf:rawbody            __HAS_IMG_SRC
> >>/^[^>].*?<img src=/im 72_active.cf:rawbody  __HAS_IMG_SRC_DATA
> >>/^[^>].*?<img src=['"]data/im 72_active.cf:rawbody
> >>__HAS_IMG_SRC_ONECASE   /^[^>].*?<(img src|IMG SRC)=/m
> >>
> >>There are different checks that have the "*" quantifier tho.
> >>Is it reasonable to replace them with {0,1000} globally?
> 
> On 05.12.19 17:21, RW wrote:
> >In rawbody rules the text is broken into chunks of 1024 to 2048 bytes,
> >so the worst case isn't all that much worst than with {0,1000}.
> >
> >Also  /m means that .* wont cross a line boundary in the decoded text
> >and  ^ can match in the middle of the chunk. This make the average
> >processing  time less sensitive to any upper limit on .*.
> 
> so it is not the quantifiers who cause SA taking too much of memory?
> 
> any idea how to debug that?

Scanning a generic 20MB will normally eat ~700MB memory.  3GB implies
something is bugging.  Feel free to send a sample if you can.