You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@spamassassin.apache.org by Reindl Harald <h....@thelounge.net> on 2015/05/03 10:55:24 UTC

interesting spammer trick (bayes)

Hi

recently i observed by playing around with bayes-training that some junk 
(maybe unintentional) is using the mimetype 'application/octet-stream' 
instead 'text/html' containing the payload of a form with javascript 
prevets the attachment from tokenizing
________________________________________

the new feature in 3.4.1 will take care of that while i am not sure how 
much impact in classifying a trained attachment at the end has

SHA1 digests of all MIME parts (including non-textual) can now be
contributed to Bayes tokens, which allows the bayes classifier to assess
also the non-textual content. The set of sources of bayes tokens is
configurable with a new configuration option 'bayes_token_sources'
as documented in the Mail::SpamAssassin::Conf man page. (Bug 7115)
It is disabled by default for backward compatibility.
________________________________________

i am not sure here in context of "backward compatibility"

correct me but IMHO "bayes_token_sources all" should not have a side 
effect when you train a bayes on SA 3.4.1 and share it with a setup 
using 3.4.0 - the 3.4.0 setup just should not benefit from the new 
mimeparts-tokens in the database but still from all others?
________________________________________

https://spamassassin.apache.org/full/3.4.x/doc/Mail_SpamAssassin_Conf.html

bayes_token_sources (default: header visible invisible uri)

Controls which sources in a mail message can contribute tokens (e.g. 
words, phrases, etc.) to a Bayes classifier. The argument is a 
space-separated list of keywords: header, visible, invisible, uri, 
mimepart), each of which may be prefixed by a no to indicate its 
exclusion. Additionally two reserved keywords are allowed: all and none 
(or: noall). The list of keywords is processed sequentially: a keyword 
all adds all available keywords to a set being built, a none or noall 
clears the set, other non-negated keywords are added to the set, and 
negated keywords are removed from the set. Keywords are case-insensitive.

The default set is: header visible invisible uri, which is equivalent 
for example to: All NoMIMEpart. The reason why mimepart is not currently 
in a default set is that it is a newer source (introduced with 
SpamAssassin version 3.4.1) and not much experience has yet been 
gathered regarding its usefulness.

See also option bayes_ignore_header for a fine-grained control on 
individual header fields under the umbrella of a more general keyword 
header here.

     Keywords imply the following data sources:

     header - tokens collected from a message header section
     visible - words from visible text (plain or HTML) in a message body
     invisible - hidden/invisible text in HTML parts of a message body
     uri - URIs collected from a message body
     mimepart - digests (hashes) of all MIME parts (textual or 
non-textual) of a message, computed after Base64 and quoted-printable 
decoding, suffixed by their Content-Type
     all - adds all the above keywords to the set being assembled
     none or noall - removes all keywords from the set



Re: interesting spammer trick (bayes)

Posted by Mark Martinec <Ma...@ijs.si>.
On 2015-05-03 10:55, Reindl Harald wrote:
> recently i observed by playing around with bayes-training that some junk
> (maybe unintentional) is using the mimetype 'application/octet-stream'
> instead 'text/html' containing the payload of a form with javascript
> prevets the attachment from tokenizing
> ________________________________________
>
> the new feature in 3.4.1 will take care of that while i am not sure how
> much impact in classifying a trained attachment at the end has
>
> SHA1 digests of all MIME parts (including non-textual) can now be
> contributed to Bayes tokens, which allows the bayes classifier to assess
> also the non-textual content. The set of sources of bayes tokens is
> configurable with a new configuration option 'bayes_token_sources'
> as documented in the Mail::SpamAssassin::Conf man page. (Bug 7115)
> It is disabled by default for backward compatibility.
> ________________________________________
>
> i am not sure here in context of "backward compatibility"

Just a cautionary speech. There were some concerns whether
it is beneficial or not to contribute digests of non-textual
parts or not, and not much experience has been gained yet,
so to avoid any potential surprise the default is the same
as with 3.4.0, i.e. digests are not included.

In my experience it can be valuable to include these, and
I haven't seen any ill effect while observing top-10
bayes tokens containing digests, as logged by a debug log,
for several weeks.

> correct me but IMHO "bayes_token_sources all" should not have a side
> effect when you train a bayes on SA 3.4.1 and share it with a setup
> using 3.4.0 - the 3.4.0 setup just should not benefit from the new
> mimeparts-tokens in the database but still from all others?

That is correct, learned digest tokens as inserted by 3.4.1 are
ignored by 3.4.0 code.

Btw, note that spamd does not process messages larger than some
pre-set size limit. Even if truncated messages are passed to
spamd, it would not see MIME parts beyond the truncation limit.
This is unlike what the current (to-be-released) version of
amavisd does: regardless of mail size amavisd would compute
digests of *all* pristine mail parts, and pass them to SpamAssassin
out-of-band, already ready-to-use, even if a message is truncated.
This also avoids some pre-processing 'corruption' of MIME digests
when computed by SpamAssassin, as a 'pristine' mail as understood
by SpamAssassin is sometimes a little less 'pristine' than ideal,
e.g. due to squashing long runs of empty lines in a message,
and splitting long paragraphs into chunks.

With MIME digests it's the same approach as with DKIM signatures,
which are also pre-computed by amavisd on the complete (non-truncated)
pristine message, and passed to SpamAssassin for use in the DKIM
plugin.

   Mark